When a database is limited to running on a single computer, only certain load can be served with acceptable latency. Adding hardware to this single computer will help only so much. Doubling or tripling the load may require significantly more than twice or thrice cost of hardware to scale up. Such approach is expensive to scale and it eventually hits its limit. Ideally, we would start with a single inexpensive computer and as load increases we would keep on adding same inexpensive computers resulting in a near-linear function between load and cost. Such horizontal scaling out is common place in today’s web applications because it provides a more predictable cost model.
Clusters of inexpensive commodity hardware lead to a disruption in the database ecosystem. What further amplifies this disruption in falling prices of RAM and solid state storage. A number NoSQL of databases are truly leveraging this disruption. At Meshin, one of our key requirements is the lightning-fast delivery of query results. Think of Google Instant search. Showing results with “as you type” latency enables a whole new class of use cases. This lead us to considering various in-memory database engines. Redis, with its simple and elegant data model and very transparent performance characteristics came out on top. For in-memory databases, single-threaded design becomes an obvious choice, essentially removing significant overhead common to traditional database architectures. Another assumption in Meshin design is that if the index is partially or completely lost it can be recovered by re-indexing original data. Sure, this introduces short-term inconsistencies, but on the other hand it allows to relax durability and further simplify the design. Such a trade-off is a perfect fit to the Redis persistence strategy.
Single-threaded design of Redis brings up another interesting issue. Scaling of a single instance of Redis is not limited just by the computer it runs on, which would amount to RAM size, but is further limited by the single core or hardware thread it will utilize. So if your load exceeds the single hardware thread or size of available RAM, you need a second instance of Redis. This requires some approach to clustering. Effective clustering is all about figuring out the right partitioning scheme. Ideally, you need to split load into identically-sized partitions. If load becomes biased towards one or more partitions you will have a bottleneck. Such a bottleneck will limit your system’s scalability or in other words will make your load-to-cost function non-linear. It is important that partitions are as independent as possible. If an operation spans a group of partitions its scalability will be limited by the number of such groups in the cluster. In extreme cases of spanning all partitions in a cluster, the scalability will be as good as running on a single partition.
Another aspect is differentiating the load by reads and writes. If the Redis hardware thread capacity is exceeded by reads, it is easy to scale out by putting additional read-only replicas of the same data. It is much harder to scale out writes by replication where trading off consistency is often required. Redis again takes simple a approach with its master-slave replication. A writable master replica asynchronously updates one or more read-only slave replicas. Right after replying to a write operation the master notifies all replicas. No acknowledgment is required by the master. This means that there is short period of time when slave replicas may return old data. This provides with write-write consistency guarantees that are as strong as without replication. However, the write-read consistency guarantees are now less strong or “eventual”. It is important to note that replication not only helps scalability of reads, but also improves reliability when replicas are placed on different machines. For the Meshin application eventual write-read consistency is acceptable tradeoff for higher reliability.
With the Meshin application, we have a large number of user indexes, each of roughly equal size. Meshin keeps a sliding window index of email, Twitter, Facebook etc. messages to maintain predictable maximum size of index. This provides a great opportunity for partitioning. If a partition handles roughly an equal number of users, we have balanced storage and load requirements and at the same time made the most frequent operations directed at only one partition. One approach is choosing between N partitions is hashing a unique user identifier, for example an email address, such that the hash space is in the [0..N) range. If the quality of the hash function is sufficient, we will get a well-balanced distribution.
Now we can place our Redis instances on M = N / K computer nodes in the cluster. K is number of Redis instances that we can comfortably run on single node. This number is limited by the number of cores or hardware threads. With a large number of hardware threads (8 and up) it makes sense to map each Redis instance to one thread. RAM is then equally divided between these instances. Now, knowing each user’s maximum required RAM, we can calculate how many users our cluster can accommodate. One thing to keep in mind is that if you use Redis snapshotting for persistence — which employs the OS copy-on-write mechanism — it may require as much as twice amount the RAM dedicated to a partition to take a snapshot.
It is important to note that in the case of the Meshin application with regard to specific RAM and thread count per computer the load is “memory-bound”. Meaning that a partition having a full memory pool sill won’t saturate the hardware thread with Redis. It is important to note that for some applications the load may become “thread-bound”. In this case the hardware thread is being saturated before the memory pool is maxed out. Interestingly, with the “memory-bound” case we have underutilized CPUs and with the “thread-bound” case we may have underutilized RAM.
Great so far, but what if we keep on adding users and they start exceeding the capacity of a partition? In the “memory-bound” case they exceed the memory pool allocated for a partition and in the “thread-bound” case they fully saturate the corresponding hardware thread. In our case, the number of partitions is fixed. Also, since partitions are equally populated with same size users they all about to overflow. As we add new nodes to the cluster, we may try to move some partitions from existing ones and rebalance such that our nodes have an equal number of partitions. There are two problems with this approach. In the “thread-bound” case this won’t help at all — we can’t really utilize new cores, since partitions are single threaded! In “memory-bound” case the machines will have fewer number of Redis instances than number of hardware threads. This leaves those threads idle and CPUs even more underutilized! One way to fix this problem is to overprovision partitions initially, so that K is a multiple of the number of hardware threads. The multiplier is essentially the number of times we want to grow our cluster by. Other approach is to allow changing number of partitions.
At Meshin we took approach where we are able to add partitions over time. This is accomplished by changing the way initial partitioning is done. Instead of limiting the hash function space to (0..N] we allow it to generate 64-bit unsigned integers. These hashes are mapped to a ring and wrapped from 2^64-1 to 0. We compared several hash functions homogeneity performance for a large set of email addresses as unique user identifiers. The best homogeneity was achieved by the SHA1 hash reduced to 64 bits by XOR-ing its parts.
Each partition is placed on the ring with equal distances in between. When a user identifier is hashed we get its position on the ring and thus can find the closest partition where user’s index resides by going clockwise.
When new partitions are added they are placed in between existing partitions to maintain equal distances. We also have the flexibility to add partitions in multiple points on the ring if partitions are of variable size to enable more users’ indexes allocated in larger partitions. Now, to lookup an existing user’s index you may hit a newly added partition and miss the old partition where the index was originally allocated. One way to address this problem is to move the corresponding user indexes to new partitions or rebalance the whole hash ring. We decided to employ simple routing instead. If a partition doesn’t have the user index, it proceeds to the next partition in clockwise direction.
Such routing is fairly scalable. It does not have a single point of contention since each partition participates in routing. We also use a caching layer on the application server to remember partition addresses for recent user identities. As with lookups, it is also possible to route new user index allocations. When a partition is full it will simply forward to next partition clockwise until a partition available for allocation is found. If new partitions’ points are added in between old ones this will keep routing paths short.
To add reliability and further scale reads we replicate each partition to a different node in the cluster. Replicas are placed so that each node has an equal number of masters and slaves. All writes are directed at the master replica, while reads are balanced between master and slave replicas. Balancing uses stickiness by user identity to keep CPU caches warm. For example, replicating N partitions twice, K partitions per node, 2 * K replicas per node and M nodes in cluster.
Partitioned replicas are taking snapshots asynchronously. We place the snapshots on redundant networked storage. All snapshot traffic goes through a dedicated network. Taking snapshots requires additional memory to be allocated by Redis. The amount of this memory depends on write activity on a partition and I/O latency of snapshot operation. To minimize snapshotting overhead on a specific node in the cluster we run only one snapshotting operation at the time. A dedicated agent calls the SAVE command for each partition on a node in a round-robin fashion. The SAVE command initiates snapshotting and returns when it is completed. The period of full cycle of snapshotting defines the maximum time slice of changes that could be lost if cluster a node fails.
Cluster configuration, including partition IP addresses and ports, positions on hash ring and health check status are stored in dedicated Redis instances. We call this dedicated instance the “directory”. The directory is replicated as well with the master doing the writes, which are quite rare. The directory reads are served by the master and one or more slave replicas in round-robin manner. When the application fails to access a certain partition replica, it marks it as offline in the directory. This means other instances will stop trying to access it since it’s been removed from the list of available replicas for a partition. If a failed replica was in slave role the cluster will remain fully functional, just with less redundancy for failed partition. We call this partial outage. In turn, if the replica was in master role, the partition will switch to read-only outage mode. Read-only outage mode still allows the partition to handle read queries. If all replicas in a partition have failed cluster is in full outage. Full outage is a rare event since it requires two nodes to fail simultaneously. A special agent monitors partitions for outages by polling the directory. If there is master-less partition it will elect one of the slave replicas to become master, reconfigure the rest of the replicas and adjust the directory to clear the read-only outage. While we attempt to recover from read-only outages automatically, partial outages and full outages must be addressed manually. At some point in scaling this will not work and we realize the need for a fully-automated recovery system down the road.
Currently the cluster runs on 14 nodes each with 96GB of RAM — this amounts a total of 1.3TB. The nodes have 16 hardware threads each. We deployed 160 partitions each replicated twice. Separate computers handle the directory and networked storage for snapshotting.