When tasked to choose a database product for a “web scale” project that constitutes gigantic data and traffic, the amount of criteria to consider might be overwhelming. This post aims to put the focus on the top 5 most important criteria that should drive your decision and demonstrate how the popular vendors differ in this regard – MongoDB, Cassandra, Couchbase and MySql
The challenge
Designing a project for mass scale is different than our old comfy approach of using traditional normalized RDBMS – classic DB patterns like foreign key, transaction and auto-increment numbers are performance killers (read below why) and must be avoided or implemented with great care. Suddenly, the amount of DB products and paradigms to choose from is overwhelming, when recently tried to compare and find the perfect match for my DB layer, I found myself struggling with dozens of criteria to consider. Then while scratching my head in front of a huge comparison table I decided to curate the table, delete most of the criteria and focus on those that matters most.
Why did I narrowed my selection to these DB products? where is Oracle, MS SQL, HBase, Amazon Dynamo and Azure DocumentDB? they are all great and have their place, howeve this post is focused on large-scale, internet gigantic project which is cloud-vendor agnostic thus DB which meets the following criteria were stripped out: (1) classic, pricy RDBMS (2) niche products (3) products that are not built for mass-scale (4) avilable only under a specific cloud provider (e.g. Azure, Amazon). Consequently, I chose what I found as the most popular DB products for a web-scale project.
The critical factors
The following outlines what I found to be the most significant capabilities to consider when choosing your DB approach and DB product. Goes without saying that different requirements pose focus on different criteria, nothing can replace the judgement of a software designer, however this paper might serve as a useful starting point to kick your DB discussion and then consider when you have to deviate and promote other factors. These factors comes first mostly because there is no easy way to compensate for their abcense. For example, if your DB struggles to scale horizontally and the traffic increases dramatically – there’s no simple ‘hook’ that can get you out of this challenge, many companies chose to replace the entire DB in response when encountered similar issue.
- Atomic operations – how can you ensure that multiple writes occurs at once and upon failure roll back at once (i.e. ACID transactions)? say you wish to save a new Payment record and then increase the User’s balance column, each row exists in a different table/collection but if the later fails everything must be rolled-back. RDBMS system (i.e. SQL) solves this with ACID transactions, easy isn’t it? not really because transactions are double edge sword that might kill your performance as they lock rows/tables (most products allow to tune the locking) and blocks any other operations. Moreover transactions are not applicable in distributed systems (e.g. SOA/Micro services architecture) where the Order and User modules are separated and sometimes use different databases. The NoSQL product-line offers different approaches, for example by re-modeling the data and storing both User and Order on the same document which is always guarantees to be atomic or by using two-phase commit. These techniques are also not a free-lunch as they roll the heavy lifting work to the data scientist or to the developer. So, are you brave enough to design a financial system using NoSQL DB which lacks ACID transactions? are you optimistic enough to use RDBMS transactions in a system that holds 50K concurrent users every second? there’s no magic solution here, every project has its own constraints, requirements and budget, but I won’t leave you here dry with questions only – my 2cent for the general purpose case is opt for the second approach of avoiding transactions. Unless on a very tight budget, it’s better to sweat and gain control instead of run fast and encounter dead-ends later.
- Elastic horizontal scaling – this is the most significant factor, what can you do when your server begins to beath heavily and sweat? vertical scaling is the most straightforward approach in which you feed the machine with beefy hardware, to some extent this might work until there are no more available expansion slots in the DB server… This is where the most significant scaling tool fits in – Horizontal scaling. It comes in two flavours: (a) X-Axis scale which is about adding more nodes (machines), each handles different type of data or copies the SAME rows, for example in a typical e-commerce site, a cluster (group of servers) will handle the Users data and another cluster might deal with Orders. While this partition the gigantic data challenge into smaller problems, it carries major complexity as diferent server holds different data schema and also will fail to scale once specific data type (e.g. Orders) will grow unwieldy. (b) Z-Axis scale (a.k.a sharding) means spanning our data over multiple servers, each server handles a subset of the data (the coined Z-Axis termed in the great book – “The art of scalability”). For example, in our User-Orders example, we might store all the users from France and their orders at Server A, and those from Mexicon in a different server. Practically we partition the data by rows which is a killer technique to narrow a problem of heavily breathing DB machine with 500M users into DB machine with 5M users only (if we split these users among 100 servers). Z-Axis scaling is typically called “Sharding” (abbreviation of the term “Share Nothing”)
Two types of horizontal scaling: X-AXIA and Z-AXIS
A typical RDBMS will struggle to scale horizontally using row partitioning (i.e. sharding), how theoretically can it perform joins, transaction, ensure unique values and incremental ID if the data is spread among 100 machines? this is where NoSQL shines, it takes advantage over its loose data model and offers built-in sharding features that distribute the data among X servers. While this theoretically transparent to the developer and manageable for the ops-team – in practice it still demands a lot of planning and maintenance and consequently this is the square where NoSQL vendors compete – how easy is it to scale horizontally? - Development speed – obviously you seek not only performant DB but also to deliver on time and budget, this might not happen when dealing with emerging paradigm that carries steep learning curve, complex development model and lack of supporting libraries. Dealing with NoSQL technologies poses challenge to some developer who has to learn new concepts like map-reduce, document oriented design, two phase commits and other – this might get even worse on some DB that expect from the developer to state within queries which consistency model shall be used, which replica should be queries and other DBA-ish narratives that complicates the development. The least you can do is rely on standard and familiar NoSQL technologies when possible, for example unless dealing with huge data volume opt for commodity NoSQL products like MongoDB over specialized products like Hadoop.
- Maturity – as of 2015, no product handles gigantic data and traffic fluently without heavy planning and configuration, some still embodies cumbersome and complicated management. Though at the basic level all the top players can deliver, their differ in maturity aspects like the supporting eco-system (e.g. third party extensions), knowledge and use cases (e.g. installing a cluster with Docker over Google cloud engine), etc. Simply put, the trade-off is: MongoDB and MySQL gained more experience and larger community, Couchbase and Cassandra has better and modern built-in tooling.
Popularity comparison of the major vendors
- The backend architecture – the DB tier relations with the backhand affects the design of both, for example when the backend constitutes loosely coupled autonomous modules such as Microservices architecture, then the DB design is not allowed to assume that all the data lives in a single DB (different services/modules might utilize different databases) which obviates the possibility to use relationship, transactions or any normalized data structure.
Microservices – database per module
A Comparison table