initServer是redis对server进行初始化的入口,其由main调用,位于initServerConfig、命令行参数解析、守护进程判定之后,是server最重要的入口点。
尽管代码看似简单(102行代码,且大量的赋值语句),但顺藤摸瓜,有很多点值得仔细看看。接下来逐行分析:
函数第一件事是对信号进行处理:
899 signal(SIGHUP, SIG_IGN); 900 signal(SIGPIPE, SIG_IGN); 901 setupSignalHandlers();
redis多作为守护进程运行,这时其不会有控制终端,首先忽略掉SIGHUP信号。(见APUE2 237页);
SIGPIPE信号是在写管道发现读进程终止时产生的信号,写已终止的SOCK_STREAM套接字同样会产生此信号。redis作为server,不可避免的会遇到各种各样的client,client意外终止导致产生的信号也应该在server启动后忽略掉;
setupSignalHandlers函数处理的信号分两类:
1)SIGTERM
SIGTERM是kill命令发送的系统默认终止信号。也就是我们在试图结束server时会触发的信号。对这类信号,redis并没有立即终止进程,其处理行为是,设置一个server.shutdown_asap,然后在下一次执行serverCron时,调用prepareForShutdown做清理工作,然后再退出程序。这样可以有效的避免盲目的kill程序导致数据丢失,使得server可以优雅的退出。
2)SIGSEGV、SIGBUS、SIGFPE、SIGILL
上述信号分别为无效内存引用(即我们常说的段错误),实现定义的硬件故障,算术运算错误(如除0)以及执行非法硬件指令。这类是非常严重的错误,redis的处理是通过sigsegvHandler,记录出错时的现场、执行必要的清理工作,然后kill自身。
除上面提到的7个信号意外,redis不再处理任何其他信号,均保留默认操作。
接下来,initServer通过四行代码设置日志设施,如下:
903 if (server.syslog_enabled) { 904 openlog(server.syslog_ident, LOG_PID | LOG_NDELAY | LOG_NOWAIT, 905 server.syslog_facility); 906 }
记录自己的线程ID:
908 server.mainthread = pthread_self();
然后将当前处理的client(current_client)设置为NULL,将clients、slaves、monitors、unblocked_clients通通初始化为空的list。
接下来,调用createSharedObjects(),完成共同object的初始化。这里解释下这个函数。redis在初始化时会把后续server执行过程中普遍需要的对象构造出来,如对执行成功的反馈值“+OK”,特定类型的错误值“+-ERR no such key\r\n”等等,这些对象多用在与客户端的响应的纯文本协议之中,现在版本共有30+,避免了临时申请对象的开销,同时也简化了资源管理。
在执行此函数后,将会初始化事件循环server.el以及维护db所需要的数据结构,代码如下:
915 server.el = aeCreateEventLoop(); 916 server.db = zmalloc(sizeof(redisDb)*server.dbnum);
aeCreateEventLoop函数已经在介绍redis事件框架ae.c时提到了(http://www.cnblogs.com/liuhao/archive/2012/05/15/2502322.html),这里不再赘述。
接下来,初始化监听的连接,包括SOCK_STREAM和UNIX_STREAM,如果创建失败,或是均未设置,则退出程序的执行流程。
918 if (server.port != 0) { 919 server.ipfd = anetTcpServer(server.neterr,server.port,server.bindaddr); 920 if (server.ipfd == ANET_ERR) { 921 redisLog(REDIS_WARNING, "Opening port %d: %s", 922 server.port, server.neterr); 923 exit(1); 924 } 925 } 926 if (server.unixsocket != NULL) { 927 unlink(server.unixsocket); /* don't care if this fails */ 928 server.sofd = anetUnixServer(server.neterr,server.unixsocket,server.unixsocketper m); 929 if (server.sofd == ANET_ERR) { 930 redisLog(REDIS_WARNING, "Opening socket: %s", server.neterr); 931 exit(1); 932 } 933 } 934 if (server.ipfd < 0 && server.sofd < 0) { 935 redisLog(REDIS_WARNING, "Configured to not listen anywhere, exiting."); 936 exit(1); 937 }
接下来,程序初始化server的db数据结构,如下:
938 for (j = 0; j < server.dbnum; j++) { 939 server.db[j].dict = dictCreate(&dbDictType,NULL); 940 server.db[j].expires = dictCreate(&keyptrDictType,NULL); 941 server.db[j].blocking_keys = dictCreate(&keylistDictType,NULL); 942 server.db[j].watched_keys = dictCreate(&keylistDictType,NULL); 943 if (server.vm_enabled) 944 server.db[j].io_keys = dictCreate(&keylistDictType,NULL); 945 server.db[j].id = j; 946 }
这里,对db数据结构内的各个dict类型加以说明。
db.dict的类型是dbDictType,它是数据库所有数据的总的存储和索引,存的是string->redisObject的一个映射,比如简单的key-value,那么redisObject就是一个string,存储链表结构,redisObject保存的就是链表。
db.expires的类型是keyptrDictType,它存储的是设置了超时的key和对应的超时时间,即string->time_t的一个映射,这在介绍redis对过期值的处理时有所介绍(http://www.cnblogs.com/liuhao/archive/2012/05/25/2518185.html)。
db.blocking_keys和db.watched_keys均是keylistDictType类型,对应的value是list类型,key是redisObject。其value链表中存的是一系列client,表示特定redisObject状态有变化时(如执行BLPOP,队列中有新的元素即为状态有变化)通知list中的所有客户端。
因为新版中vm已经彻底废弃,所以和vm相关联的代码都略过不表。
在对db的数据结构进行初始化后,对pubsub_channels进行了初始化,pubsub_channels同样是keylistDictType的dict,用来记录订阅的所有client。
然后对pubsub_patterns进行了初始化。(这里插一句,redis的pubsub是个极其简陋的实现,对持久化、网络瞬断均无处理,不推荐在项目中使用)
然后将两个后台save子进程(bgsavechildpid和bgrewritechildpid)的pid初始化为-1,将用于aof和rewrite的buf初始化为empty的字符串,然后初始化了一系列的统计信息,略去不表。
有两点需要解释下:
957 server.dirty = 0;
用来后续计算server维护的数据是否有更新,如果有,需要记录aof和通知replication.
967 server.unixtime = time(NULL);
用于时间值保留,其精度为s,类似于一个缓存。redis的代码中有很多需要时间值的地方,只要其精度要求不是很高,server.unixtime又有合理的机制进行更新,就可以避免在每次需要时间值的时候执行昂贵的time系统调用。
接下来,注册serverCron函数,这是个定期执行的函数,执行周期是100ms,这个函数也是个重点,以后会专门介绍。这里注册是在1ms后调度serverCron,但:-),这里其实运行起来并不要求(保证)1ms后serverCron一定被调用,aeCreateTimeEvent只是注册函数,真正何时执行取决于initServer执行后aeMain函数的执行,该函数触发事件循环真正转起来。
968 aeCreateTimeEvent(server.el, 1, serverCron, NULL, NULL);
然后,initServer将监听的描述符(ipfd - TCP or sofd - UNIX_STREAM)加入事件监控列表,这里以ipfd举例:
969 if (server.ipfd > 0 && aeCreateFileEvent(server.el,server.ipfd,AE_READABLE, 970 acceptTcpHandler,NULL) == AE_ERR) oom("creating file event");
在有连接请求进来后,acceptTcpHandler将会被调用,该函数调用accept接收连接,然后用accept函数返回的文件描述符创建一个client桩(一个redisClient对象),在server端代表连接进来的真正client。在创建client桩的时候,会将返回的这个描述符同样添加进事件监控列表,监控READABLE事件,事件发生代表着客户端发送数据过来,此时调用readQueryFromClient接收客户端的query。
在创建上述监听时间后,如果server设置了aof模式做持久化,将会打开对应的文件,保存相关的描述符,代码如下:
974 if (server.appendonly) { 975 server.appendfd = open(server.appendfilename,O_WRONLY|O_APPEND|O_CREAT,0644); 976 if (server.appendfd == -1) { 977 redisLog(REDIS_WARNING, "Can't open the append-only file: %s", 978 strerror(errno)); 979 exit(1); 980 } 981 }
接下来,对于32位架构的系统,如果没有设置最大内存占用限制(maxmemory),则将此限制设定为3.5G,并把maxmemory_policy设置为REDIS_MAXMEMORY_NO_EVICTION,表示在程序达到最大内存限制后,拒绝后续会增大内存使用的客户端执行的命令。不过redis作为一个内存大杀器,3.5G、32位系统实在已经无法满足日益增长的需求了。
函数执行最后,初始化slowlog,bio和一个随机数种子。
slowlogInit()参见http://www.cnblogs.com/liuhao/archive/2012/05/20/2510725.html
bioInit()参见http://www.cnblogs.com/liuhao/archive/2012/05/17/2506810.html
旅程到此为止,over!
纯文本协议,请求-响应模式。看下边链接:http://redis.io/topics/protocol
《Unix编程艺术》中明确倡导使用纯文本协议。作者在specification的开头就指出,Redis的协议设计是如下三点的折中:
- Simple to implement
- Fast to parse by a computer
- Easy enough to parse by a human
一个如此重视性能的代码实现选用了如此简单易懂的协议设计,相信对仍旧执拗于使用二进制协议设计的开发者是个启发。
redis可以被作为类似memcached的应用级缓存使用,在内存超过限制时,按照配置的策略,淘汰掉相应的kv,使得内存可以继续留有足够的空间保存新的数据。
redis的conf文件中有对该机制的一份很好的解释:
194 # Don't use more memory than the specified amount of bytes. 195 # When the memory limit is reached Redis will try to remove keys 196 # accordingly to the eviction policy selected (see maxmemmory-policy). 197 # 198 # If Redis can't remove keys according to the policy, or if the policy is 199 # set to 'noeviction', Redis will start to reply with errors to commands 200 # that would use more memory, like SET, LPUSH, and so on, and will continue 201 # to reply to read-only commands like GET. 202 # 203 # This option is usually useful when using Redis as an LRU cache, or to set 204 # an hard memory limit for an instance (using the 'noeviction' policy). 205 # 206 # WARNING: If you have slaves attached to an instance with maxmemory on, 207 # the size of the output buffers needed to feed the slaves are subtracted 208 # from the used memory count, so that network problems / resyncs will 209 # not trigger a loop where keys are evicted, and in turn the output 210 # buffer of slaves is full with DELs of keys evicted triggering the deletion 211 # of more keys, and so forth until the database is completely emptied. 212 # 213 # In short... if you have slaves attached it is suggested that you set a lower 214 # limit for maxmemory so that there is some free RAM on the system for slave 215 # output buffers (but this is not needed if the policy is 'noeviction'). 216 # 217 # maxmemory <bytes>
注意,在redis按照master-slave使用时,其maxmeory应设置的比实际物理内存稍小一些,给slave output buffer留有足够的空间。
redis支持如下五种缓存淘汰策略:
219 # MAXMEMORY POLICY: how Redis will select what to remove when maxmemory 220 # is reached? You can select among five behavior: 221 # 222 # volatile-lru -> remove the key with an expire set using an LRU algorithm 223 # allkeys-lru -> remove any key accordingly to the LRU algorithm 224 # volatile-random -> remove a random key with an expire set 225 # allkeys->random -> remove a random key, any key 226 # volatile-ttl -> remove the key with the nearest expire time (minor TTL) 227 # noeviction -> don't expire at all, just return an error on write operations
注释已经解释的很清楚了,不再赘述。
其缓存管理功能,由redis.c文件中的freeMemoryIfNeeded函数实现。如果maxmemory被设置,则在每次进行命令执行之前,该函数均被调用,用以判断是否有足够内存可用,释放内存或返回错误。如果没有找到足够多的内存,程序主逻辑将会阻止设置了REDIS_COM_DENYOOM flag的命令执行,对其返回command not allowed when used memory > 'maxmemory'的错误消息。
具体代码如下:
int freeMemoryIfNeeded(void) { size_t mem_used, mem_tofree, mem_freed; int slaves = listLength(server.slaves); /* Remove the size of slaves output buffers and AOF buffer from the * count of used memory. */ 计算占用内存大小时,并不计算slave output buffer和aof buffer,因此maxmemory应该比实际内存小,为这两个buffer留足空间。 mem_used = zmalloc_used_memory(); if (slaves) { listIter li; listNode *ln; listRewind(server.slaves,&li); while((ln = listNext(&li))) { redisClient *slave = listNodeValue(ln); unsigned long obuf_bytes = getClientOutputBufferMemoryUsage(slave); if (obuf_bytes > mem_used) mem_used = 0; else mem_used -= obuf_bytes; } } if (server.appendonly) { mem_used -= sdslen(server.aofbuf); mem_used -= sdslen(server.bgrewritebuf); } /* Check if we are over the memory limit. */ if (mem_used <= server.maxmemory) return REDIS_OK; if (server.maxmemory_policy == REDIS_MAXMEMORY_NO_EVICTION) return REDIS_ERR; /* We need to free memory, but policy forbids. */ /* Compute how much memory we need to free. */ mem_tofree = mem_used - server.maxmemory; mem_freed = 0; while (mem_freed < mem_tofree) { int j, k, keys_freed = 0; for (j = 0; j < server.dbnum; j++) { long bestval = 0; /* just to prevent warning */ sds bestkey = NULL; struct dictEntry *de; redisDb *db = server.db+j; dict *dict; if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU || server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM) { dict = server.db[j].dict; } else { dict = server.db[j].expires; } if (dictSize(dict) == 0) continue; /* volatile-random and allkeys-random policy */ if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM || server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_RANDOM) { de = dictGetRandomKey(dict); bestkey = dictGetEntryKey(de); }//如果是random delete,则从dict中随机选一个key /* volatile-lru and allkeys-lru policy */ else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU || server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU) { for (k = 0; k < server.maxmemory_samples; k++) { sds thiskey; long thisval; robj *o; de = dictGetRandomKey(dict); thiskey = dictGetEntryKey(de); /* When policy is volatile-lru we need an additonal lookup * to locate the real key, as dict is set to db->expires. */ if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU) de = dictFind(db->dict, thiskey); //因为dict->expires维护的数据结构里并没有记录该key的最后访问时间 o = dictGetEntryVal(de); thisval = estimateObjectIdleTime(o); /* Higher idle time is better candidate for deletion */ if (bestkey == NULL || thisval > bestval) { bestkey = thiskey; bestval = thisval; } }//为了减少运算量,redis的lru算法和expire淘汰算法一样,都是非最优解,lru算法是在相应的dict中,选择maxmemory_samples(默认设置是3)份key,挑选其中lru的,进行淘汰 } /* volatile-ttl */ else if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_TTL) { for (k = 0; k < server.maxmemory_samples; k++) { sds thiskey; long thisval; de = dictGetRandomKey(dict); thiskey = dictGetEntryKey(de); thisval = (long) dictGetEntryVal(de); /* Expire sooner (minor expire unix timestamp) is better * candidate for deletion */ if (bestkey == NULL || thisval < bestval) { bestkey = thiskey; bestval = thisval; } }//注意ttl实现和上边一样,都是挑选出maxmemory_samples份进行挑选 } /* Finally remove the selected key. */ if (bestkey) { long long delta; robj *keyobj = createStringObject(bestkey,sdslen(bestkey)); propagateExpire(db,keyobj); //将del命令扩散给slaves /* We compute the amount of memory freed by dbDelete() alone. * It is possible that actually the memory needed to propagate * the DEL in AOF and replication link is greater than the one * we are freeing removing the key, but we can't account for * that otherwise we would never exit the loop. * * AOF and Output buffer memory will be freed eventually so * we only care about memory used by the key space. */ delta = (long long) zmalloc_used_memory(); dbDelete(db,keyobj); delta -= (long long) zmalloc_used_memory(); mem_freed += delta; server.stat_evictedkeys++; decrRefCount(keyobj); keys_freed++; /* When the memory to free starts to be big enough, we may * start spending so much time here that is impossible to * deliver data to the slaves fast enough, so we force the * transmission here inside the loop. */ if (slaves) flushSlavesOutputBuffers(); } }//在所有的db中遍历一遍,然后判断删除的key释放的空间是否足够 if (!keys_freed) return REDIS_ERR; /* nothing to free... */ } return REDIS_OK; }
注意,此函数是在执行特定命令之前进行调用的,并且在当前占用内存低于限制后即返回OK。因此可能在后续执行命令后,redis占用的内存就超过了maxmemory的限制。因此,maxmemory是redis执行命令所需保证的最大内存占用,而非redis实际的最大内存占用。(在不考虑slave buffer和aof buffer的前提下)
redis允许对key设置超时时间,实现过期key的自动淘汰。这篇blog分析下,其自适应(adaptive)的淘汰机制。
redis每隔100ms定时执行的循环(serverCron function)里有如下语句:
655 /* Expire a few keys per cycle, only if this is a master. 656 * On slaves we wait for DEL operations synthesized by the master 657 * in order to guarantee a strict consistency. */ 658 if (server.masterhost == NULL) activeExpireCycle();
正如文中注释所示,只有master执行expire cycle,slave会等候由master传递的DEL消息,保证master-slave在过期值处理上的一致性。(后边代码会看到,redis对过期值的选择是随机抽取的,master-slave完全可能抽取不同的值,因此要求master通过DEL消息实现同步,同时这种expire机制也是不可靠的expire,即key超时后有可能不会被删除)。
activeExpireCycle函数如下:
477 /* Try to expire a few timed out keys. The algorithm used is adaptive and 478 * will use few CPU cycles if there are few expiring keys, otherwise 479 * it will get more aggressive to avoid that too much memory is used by 480 * keys that can be removed from the keyspace. */ 481 void activeExpireCycle(void) { 482 int j; 483 484 for (j = 0; j < server.dbnum; j++) { 485 int expired; 486 redisDb *db = server.db+j; 487 488 /* Continue to expire if at the end of the cycle more than 25% 489 * of the keys were expired. */ 490 do { 491 long num = dictSize(db->expires); 492 time_t now = time(NULL); 493 494 expired = 0; 495 if (num > REDIS_EXPIRELOOKUPS_PER_CRON) 496 num = REDIS_EXPIRELOOKUPS_PER_CRON; 497 while (num--) { 498 dictEntry *de; 499 time_t t; 500 501 if ((de = dictGetRandomKey(db->expires)) == NULL) break; 502 t = (time_t) dictGetEntryVal(de); 503 if (now > t) { 504 sds key = dictGetEntryKey(de); 505 robj *keyobj = createStringObject(key,sdslen(key)); 506 507 propagateExpire(db,keyobj); //将删除操作传播给各个slaves,在此之前,还将del操作记录aof 508 dbDelete(db,keyobj); //这个函数先从db->expires中删除,然后删除db->dict 509 decrRefCount(keyobj); 510 expired++; 511 server.stat_expiredkeys++; 512 } 513 } 514 } while (expired > REDIS_EXPIRELOOKUPS_PER_CRON/4); 515 } 516 }
ExpireCycle每次尝试处理10个key,如果10个key中有>2.5个超时,则继续处理10个key。其用意在于,如果超时的key比例很高,则一次迭代处理很多个,否则等待下次serverCron循环再随机抽取。