Key-value Store - System Design

CREATED 2021/09/13 22:10 PM

Basic Requirements

put(K,V) or put(K,V,timestamp)

get(K)

size of K-V pair is very small, less than 10KB.

 

Choose AP or CP

Based on CAP Theorem (Consistency, Availability, Partition), we can only have two of three guarantees from them.

Bank system needs up-to-date data and high consistency.

Others choose AP. The stale data could still be returned.

 

Partition & Replica

Partition or Sharding : Split data into multiple servers/machines/nodes.

Replica: Make another copies of data in case we have an outage for one machine/lose data. Replicas are placed in distinct data centers and data centers are connected with high-speed networks.

 

Consistency models

Strong consistency

Weak consistency

Eventual consistency: like Dynamo and Cassandra. Given enough time, all updates are propagated and all replicas are consistent.

 

Client <-> Coordinator <-> different nodes/servers (distributed on a ring by consistent hashing)

 

Write Path

1.write to commit log

2. write to memory cache

3.memory cache flushes to SSTables

 

Read Path

1. read from memory cache 

2. if not, checks bloom filter

3. check SSTables

4. return to client

 

Sequential IO is faster than Random IO for disk throughput.

LSM Tree

Sstable(sorted string table)

 

Bloom filter

It can be false positive but it should never be false negative. It can improve the querying latency by filtering out those data which never exists.

Files Compaction

After some time, the segments will be combined together to reduce the number of segments.

LSM tree 会定期执行文件合并(compaction)操作,将多个 segment 合并成一个较大的 segment,随后将旧的 segment 清理掉。由于每个 segment 内部的数据都是有序的,合并过程类似于归并排序,效率很高,只需要 O(n)的时间复杂度。

Delete Operation

Delete operation is actually an overwrite operation. The value is set to tombstone. 

删除操作的本质是覆盖写,而不是清除一条数据,这一点初看起来不太符合常识。墓碑会在 compact 操作中被清理掉,于是置为墓碑的数据在新的 segment 中将不复存在。

总结

添加、更新和删除数据

LevelDB写入新数据时,具体分为两个步骤:

  1. 将这个操作顺序追加到log文件末尾。尽管这是一个磁盘操作,但是文件的顺序写入效率还是跟高的,所以不会降低写入的速度
  2. 如果log文件写入成功,那么将这条key-value记录插入到内存中MemTable。

LevelDB更新一条记录时,并不会本地修改SST文件,而是会作为一条新数据写入MemTable,随后会写入SST文件,在SST文件合并过程中,新数据会处于文件尾部,而读取操作是从文件尾部倒着开始读的,所以新值一定会最先被读到。

LevelDB删除一条记录时,也不会修改SST文件,而是用一个特殊值(墓碑值,tombstone)作为value,将这个key-value对追加到SST文件尾部,在SST文件合并过程中,这种值的key都会被忽略掉。

核心思想就是把写操作转换为顺序追加,从而提高了写的效率。

 

LSM tree 存储引擎的工作原理包含以下几个要点

1 写数据时,首先将数据缓存到内存中的一个有序树结构中(称为 memtable)。同时触发相关结构的更新,例如布隆过滤器、稀疏索引。

2 当 memtable 积累到足够大时,会一次性写入磁盘中,生成一个内部有序的 segment 文件。该过程为连续写,因此效率极高。

3 进行查询时,首先检查布隆过滤器。如果布隆过滤器报告数据不存在,则直接返回不存在。否则,按照从新到老的顺序依次查询每个 segment。

4 在查询每个 segment 时,首先使用二分搜索检索对应的稀疏索引,找到数据所在的 offset 范围。然后读取磁盘上该范围内的数据,再次进行二分查找并获得结果。

5 对于大量的 segment 文件,定期在后台执行 compaction 操作,将多个文件合并为更大的文件,以保证查询效率不衰减。

(Reference - https://www.zhihu.com/question/19887265/answer/1714901833)

 

Structure of LevelDB:

 

Log: Write Ahead Log(WAL). Each log contains several blocks and each block is 32KB. 

Reference

[1] https://blog.csdn.net/weixin_44039270/article/details/106934601

[2] https://soulmachine.gitbooks.io/system-design/content/cn/key-value-store.html

[3] https://www.zhihu.com/question/19887265

 

posted @ 2021-09-14 13:49  YBgnAW  阅读(77)  评论(0编辑  收藏  举报