MySQL 源码解读之 IO_CACHE

Posted on 2022-05-05 13:00 面具下的戏命师阅读(549) 评论(0) 编辑收藏举报

MySQL 源码解读之 IO_CACHE

在数据库中 IO 的重要性不言而喻，为了更好的管理 IO 操作，大多数数据库都自己管理页数据和刷脏机制（例如 InnoDB 中的 Buffer pool），而不是交给文件系统甚至是操作系统调度。但是对于顺序写入的日志数据，使用文件系统接口方便的多，文件系统也是以页的形式管理，呈现给应用层的是一片连续可写的空间，管理的单位称为 Sector 大小是 4KB，所以对于 4KB 对齐的地址读写可以避免跨多个 Sector，对文件系统的性能有很大的提高。MySQL 中的 IO_CACHE 的作用就是把连续的文件读写操作，经过缓冲，转化为 4K 对齐的文件读写操作。

如图所示，对于文件的读写操作如果小于 IO_CACHE 大小，就放到缓冲中，当 IO_CACHE 满了就进行一次 4KB 对齐的写入，如果一次读写超过 IO_CACHE 的大小，就把 4K 对齐的数据进行一次读写，剩余部分放到 IO_CACHE 中，等待下次读写一起合并。（此处留个疑问，如果在下次写到来之前，进行读操作，IO_CACHE 中的数据会不会读不到，目测可以调用 flush_io_cache 函数，还没读到，暂时保留该疑问）

源码解析

[源码路径：include/my_sys.h, mysys/mf_iocache.cc]

IO_CACHE 有不同的类型，定义在 cache_type 中：

enum cache_type {
  TYPE_NOT_SET = 0,
  READ_CACHE,
  WRITE_CACHE,
  SEQ_READ_APPEND /* sequential read or append 顺序读取或追加*/,     
  READ_FIFO,
  READ_NET,
  WRITE_NET
};

常用的 general log, slow log, err log, binlog 主要使用 READ_CACHE, WRITE_CACHE, SEQ_READ_APPEND 几种类型，本文主要介绍这几种。同时 IO_CACHE 也提供支持 AIO 的接口，支持多线程同时访问 IO_CACHE 等，目前来看来应用也不多，暂不涉及

主要代码在 mysys/mf_iocache.c 中，

READ_CACHE 是读缓冲，WRITE_CACHE 是写缓冲，SEQ_READ_APPEND 同时支持读写，写线程不断 append 数据到文件尾，读线程去 read 数据。append 使用 IO_CACHE::write_buffer, read 使用 IO_CACHE::buffer。当读到 write_buffer 中的数据时，就从 write_buffer 中拿数据。SEQ_READ_APPEND 这种类型在 MySQL 复制模块使用，IO 线程负责 append 数据到 relay log，SQL 线程负责 read 出来应用（考虑下为什么在主库上的写入线程和 Dump 线程之间不是使用这种方法，而是简单的 read-write，因为主库上 order_commit 函数很可能成为性能的瓶颈，和 Dump 线程竞争 append_buffer_lock 似乎并不好）。

基础数据结构

基本的结构是 IO_CACHE，代码中注释写的比较清楚，这里贴一下方便后面看，

struct IO_CACHE /* Used when cacheing files */
{
  /* Offset in file corresponding to the first byte of uchar* buffer. */   buffer的第一个字节对应在文件中的偏移量
  my_off_t pos_in_file{0};
  /*
    The offset of end of file for READ_CACHE and WRITE_CACHE.    对于 READ_CACHE 和 WRITE_CACHE 是文件末尾的偏移量
    For SEQ_READ_APPEND it the maximum of the actual end of file and   对于 SEQ_READ_APPEND 来说是实际的文件末尾，和 read_end表示的位置的最大值
    the position represented by read_end.
  */
  my_off_t end_of_file{0};
  /* Points to current read position in the buffer */   缓冲区中当前的读取位置
  uchar *read_pos{nullptr};
  /* the non-inclusive boundary in the buffer for the currently valid read */  在缓冲区中当前允许有效读取的边界，即最大容许的读取位置
  uchar *read_end{nullptr};
  uchar *buffer{nullptr}; /* The read buffer */       读缓冲区
  /* Used in ASYNC_IO */      用于异步 IO
  uchar *request_pos{nullptr};

  /* Only used in WRITE caches and in SEQ_READ_APPEND to buffer writes */   写缓冲区，仅用于 WRIRE_CACHE 和 SEQ_READ_APPEND
  uchar *write_buffer{nullptr};
  /*
    Only used in SEQ_READ_APPEND, and points to the current read position   指向读缓冲区的当前位置，仅用于 SEQ_READ_APPEND。 注意，从SEQ_READ_APPEND缓冲中读取
    in the write buffer. Note that reads in SEQ_READ_APPEND caches can      即可以发生在 read buffer 中，又可以发生在 write buffer 中
    happen from both read buffer (uchar* buffer) and write buffer
    (uchar* write_buffer).
  */
  uchar *append_read_pos{nullptr};
  /* Points to current write position in the write buffer */     写缓冲区中当前的写入位置
  uchar *write_pos{nullptr};
  /* The non-inclusive boundary of the valid write area */        在缓冲器中允许有效写入的边界，即最大容许写入位置
  uchar *write_end{nullptr};

  /*
    Current_pos and current_end are convenience variables used by               当 my_b_tell()或者其他程序需要知道当前的偏移量，由currerr_pos 和 curren_end 两个变量为
    my_b_tell() and other routines that need to know the current offset         这些程序提供。 
    current_pos points to &write_pos, and current_end to &write_end in a        在 write_cache 中 current_pos 指向 write_pos, current_end 指向 write_end
    WRITE_CACHE, and &read_pos and &read_end respectively otherwise             在 read_cache 中 current_pos 指向 read_pos,  currend_end 指向 read_end
  */
  uchar **current_pos{nullptr}, **current_end{nullptr};

  /*
    The lock is for append buffer used in SEQ_READ_APPEND cache     锁用于 SEQ_READ_APPEND CACHE 类型中，从 append buffer 中拷贝到 read buffer 时需要 mutex 互斥量
    need mutex copying from append buffer to read buffer.
  */
  mysql_mutex_t append_buffer_lock;
  /*
    The following is used when several threads are reading the      当多个线程同时读取同一个文件，
    same file in parallel. They are synchronized on disk         
    accesses reading the cached part of the file asynchronously.    这个成员被设置为 NULL 禁用这个功能，只有 READ_CACHE 模式支持
    It should be set to NULL to disable the feature.  Only
    READ_CACHE mode is supported.
  */
  IO_CACHE_SHARE *share{nullptr};

  /*
    A caller will use my_b_read() macro to read from the cache            如果数据已经存在于缓冲中，调用者使用 my_b_read()宏进行读取。它只是简单的使用内存拷贝。内部变量将在
    if the data is already in cache, it will be simply copied with        不调用任何函数的情况下进行更新，但是如果数据未完全在缓冲区中，my_b_read()将会调用 read_function
    memcpy() and internal variables will be accordinging updated with     来获取数据，read_function 不可以直接调用
    no functions invoked. However, if the data is not fully in the cache,
    my_b_read() will call read_function to fetch the data. read_function
    must never be invoked directly.
  */
  int (*read_function)(IO_CACHE *, uchar *, size_t){nullptr};
  /*
    Same idea as in the case of read_function, except my_b_write() needs to        和 read_function 一样   都属于函数指针
    be replaced with my_b_append() for a SEQ_READ_APPEND cache
  */
  int (*write_function)(IO_CACHE *, const uchar *, size_t){nullptr};
  /*
    Specifies the type of the cache. Depending on the type of the cache           指定缓存的类型
    certain operations might not be available and yield unpredicatable
    results. Details to be documented later
  */
  cache_type type{TYPE_NOT_SET};
  /*
    Callbacks when the actual read I/O happens. These were added and            当真正的发生读 I/O 时进行回调，目前他们用于 LOAD DATA INFILE 时记录二进制，当从文件中读取
    are currently used for binary logging of LOAD DATA INFILE - when a          块时，创建一个块添加事件，当 IO_CACHE 关闭时，创建一个结束事件（二进制日志事件）
    block is read from the file, we create a block create/append event, and
    when IO_CACHE is closed, we create an end event. These functions could,     这些函数也可以用于其他用途
    of course be used for other things
  */
  IO_CACHE_CALLBACK pre_read{nullptr};
  IO_CACHE_CALLBACK post_read{nullptr};
  IO_CACHE_CALLBACK pre_close{nullptr};
  /*
    Counts the number of times, when we were forced to use disk. We use it to    统计缓冲刷入磁盘的次数，用来增长  binlog_cache_disk_user 和 binlog_stmt_cache_disk_use
    increase the binlog_cache_disk_use and binlog_stmt_cache_disk_use status     这两个状态变量
    variables.
  */
  ulong disk_writes{0};      
  void *arg{nullptr};       /* for use by pre/post_read */            用于 pre/post_read 函数
  char *file_name{nullptr}; /* if used with 'open_cached_file' */     用于  open_cached_file 创建临时文件，dir 为创建文件的目录，prefix 为创建文件的前缀
  char *dir{nullptr}, *prefix{nullptr};
  File file{-1};                               /* file descriptor */      对应文件的文件描述符
  PSI_file_key file_key{PSI_NOT_INSTRUMENTED}; /* instrumented file key */   插入指令的文件密钥

  /*
    seek_not_done is set by my_b_seek() to inform the upcoming read/write
    operation that a seek needs to be preformed prior to the actual I/O
    error is 0 if the cache operation was successful, -1 if there was a
    "hard" error, and the actual number of I/O-ed bytes if the read/write was
    partial.
  */
  bool seek_not_done{false};     标志 在执行读写操作前，是否需要执行 seek
  int error{0};
  /* buffer_length is memory size allocated for buffer or write_buffer */       为读或写缓冲区申请的内存大小
  size_t buffer_length{0};
  /* read_length is the same as buffer_length except when we use async io */     raed_length 和缓冲区的大小相同，但是异步 IO 除外
  size_t read_length{0};
  myf myflags{0}; /* Flags used to my_read/my_write */
  /*
    alloced_buffer is 1 if the buffer was allocated by init_io_cache() and      如果 buffer 是被 init_io_cache()函数申请的，该变量是1。如果是由用户提供的，则该变量是0
    0 if it was supplied by the user.
    Currently READ_NET is the only one that will use a buffer allocated         目前 READ_NET 是唯一一个使用其他地方分配缓冲区的
    somewhere else
  */
  bool alloced_buffer{false};
  // This is an encryptor for encrypting the temporary file of the IO cache.    用于加密 IO 缓存的临时文件的加密
  Stream_cipher *m_encryptor = nullptr;
  // This is a decryptor for decrypting the temporary file of the IO cache.     用于解密 IO 缓存的临时文件程序
  Stream_cipher *m_decryptor = nullptr;
  // Synchronize flushed buffer with disk.                                      是否刷新缓冲区同步到磁盘，默认为否                         
  bool disk_sync{false};
  // Delay in milliseconds after disk synchronization of the flushed buffer.    延迟多少秒后刷新到磁盘，必须设置 disk_sync = true
  // Requires disk_sync = true.
  uint disk_sync_delay{0};
};

初始化

初始化函数是 init_io_cache。初始化一个IO_CACHE 对象，它是对init_io_cache_ext函数的封装。定义如下：

int init_io_cache(IO_CACHE *info, File file, size_t cachesize,
                  enum cache_type type, my_off_t seek_offset, bool use_async_io,
                  myf cache_myflags) {
  return init_io_cache_ext(
      info, file, cachesize, type, seek_offset, use_async_io, cache_myflags,
      info->file_key != PSI_NOT_INSTRUMENTED ? info->file_key
                                             : key_file_io_cache);
}

init_io_cache_ext函数进行真正的初始化，主要会做以下几件事

和对应的文件描述符绑定，初始化 IO_CACHE 中各种变量。
分配 write_buffer 和 read_buffer 的空间。
初始化互斥变量 append_buffer_lock. (对于 SEQ_READ_APPEND 类型而言)
init_functions 初始化对应的文件读写函数。

其中根据传入的参数 cache_size 分配缓冲空间，一般传入的空间都不算大，例如 Binlog 的 IO_CACHE 初始化传入的大小就是 IO_SIZE（4KB），因为文件系统本身是有 page cache 的，只有调用 fsync 操作才会保证数据落盘，所以 IO_CACHE 就没必要缓冲太多的数据，只做把数据对齐写入的活。但并不是传进来多大空间就分配多大空间，看下代码：

 min_cache = use_async_io ? IO_SIZE * 4 : IO_SIZE * 2;
  if (type == READ_CACHE ||
      type == SEQ_READ_APPEND) { /* Assume file isn't growing */   假设文件大小不会再增加了
    if (!(cache_myflags & MY_DONT_CHECK_FILESIZE)) {
      /* Calculate end of file to avoid allocating oversized buffers */   计算文件大小，找到文件的末尾
      end_of_file = mysql_encryption_file_seek(info, 0L, MY_SEEK_END, MYF(0));    
      /* Need to reset seek_not_done now that we just did a seek. */   因为我们在计算文件大小做了 seek 操作，因此需要重新设置 seek_not_done
      info->seek_not_done = !(end_of_file == seek_offset);
      if (end_of_file < seek_offset) end_of_file = seek_offset;
      /* Trim cache size if the file is very small */         如果文件很小，修剪 cache_size 的大小等于未读文件的大小 + 8K
      if ((my_off_t)cachesize > end_of_file - seek_offset + IO_SIZE * 2 - 1) {
        cachesize = (size_t)(end_of_file - seek_offset) + IO_SIZE * 2 - 1;
        use_async_io = false; /* No need to use async */
      }
    }
  }
  cache_myflags &= ~MY_DONT_CHECK_FILESIZE;
  if (type != READ_NET && type != WRITE_NET) {
    /* Retry allocating memory in smaller blocks until we get one */    尝试申请内存直到申请成功，对 cachesize 以最小单位 min_cache 做对齐
    cachesize = ((cachesize + min_cache - 1) & ~(min_cache - 1));
    for (;;) {
      size_t buffer_block;
      /*
        Unset MY_WAIT_IF_FULL bit if it is set, to prevent conflict with
        MY_ZEROFILL.
      */
      myf flags = (myf)(cache_myflags & ~(MY_WME | MY_WAIT_IF_FULL));

      if (cachesize < min_cache) cachesize = min_cache;
      buffer_block = cachesize;
      if (type == SEQ_READ_APPEND) buffer_block *= 2;           // SEQ_READ_APPEND 类型分配两倍空间
      if (cachesize == min_cache) flags |= (myf)MY_WME;

      if ((info->buffer = (uchar *)my_malloc(key_memory_IO_CACHE, buffer_block,
                                             flags)) != nullptr) {
        info->write_buffer = info->buffer;
        if (type == SEQ_READ_APPEND)
          info->write_buffer = info->buffer + cachesize;
        info->alloced_buffer = true;
        break; /* Enough memory found */        有足够内存找到，直接结束
      }
      if (cachesize == min_cache) return 2; /* Can't alloc cache */    如果没有足够内存，缩小 cache_size 为原大小的 3/4
      /* Try with less memory */
      cachesize = (cachesize * 3 / 4 & ~(min_cache - 1));

最小的分配空间在不使用 AIO 的情况下是 8K，这个后面会用到，SEQ_READ_APPEND 类型会分配两倍空间，因为有读缓冲和写缓冲。如果申请的空间无法满足就试图申请小一点的空间。

init_functions 是根据 IO_CACHE 的类型初始化 IO_CACHE::read_function 和 IO_CACHE::write_function，当缓冲大小没法满足文件 IO 请求的时候就会调用这两个函数去文件中交换数据。

    case SEQ_READ_APPEND:
      info->read_function = _my_b_seq_read;
      info->write_function = nullptr; /* Force a core if used */
      break;
    default:
      info->read_function = info->share ? _my_b_read_r : _my_b_read;
      info->write_function = _my_b_write;

SEQ_READ_APPEND 的写直接调用 my_b_append。

调用接口

简单看几个常用的接口函数，主要的接口在 include/my_sys.h 文件中，大多是宏定义形式。

my_b_read函数，定义在 my.sys.h 中，如果数据已经在缓冲区，那会使用 my_b_read 函数调用，它只是简单的内存拷贝

inline int my_b_read(IO_CACHE *info, uchar *buffer, size_t count) {
  if (info->read_pos + count <= info->read_end) {                         // 判断缓冲区是否有足够的长度，如果有，直接读取
    memcpy(buffer, info->read_pos, count);
    info->read_pos += count;
    return 0;
  }
  return (*info->read_function)(info, buffer, count);                    // 如果缓冲区没有足够的的长度或者刚初始化，就调用读函数
}

读取函数 _my_b_read

读取函数首先将缓存中内容所有读出，而后从文件中以IO_SIZE为单位读取到足够的内容，剩余的内容读取到缓存（IO_CACHE）的buffer中以后，再读取出。

int _my_b_read(IO_CACHE *info, uchar *Buffer, size_t Count) {
  size_t length, diff_length, left_length, max_length;
  my_off_t pos_in_file;
  DBUG_TRACE;

  /* If the buffer is not empty yet, copy what is available.  如果缓冲区还有内容，直接赋值缓冲区中的可用内容*/ 
  if ((left_length = (size_t)(info->read_end - info->read_pos))) {
    assert(Count >= left_length); /* User is not using my_b_read() 
 断言检查要读取的数量大于缓冲区的剩余数量*/  
    // 把缓冲区剩余的内容拷贝过来    
    memcpy(Buffer, info->read_pos, left_length);
    Buffer += left_length;   // 根据已经读取进来的长度，指针后移
    Count -= left_length;    // 读取的总长度减去已经从缓冲区读取的长度
  }

  /* pos_in_file always point on where info->buffer was read  定义局部变量总是指向已经读取的内容*/
  // 缓冲区第一个字节对应的文件位置  +    缓冲区中已经读取的长度  = 文件当前读取的位置
  pos_in_file = info->pos_in_file + (size_t)(info->read_end - info->buffer);

 /*
    Whenever a function which operates on IO_CACHE flushes/writes
    some part of the IO_CACHE to disk it will set the property
    "seek_not_done" to indicate this to other functions operating
    on the IO_CACHE.
    无论何时一个函数操作 IO_CACHE 结构刷新或者写入一部分到磁盘，都将设置属性 seek_not_done 。像其他操作 IO_CACHE 结构的函数表明这一点
  */
  if (info->seek_not_done) {
    if ((mysql_encryption_file_seek(info, pos_in_file, MY_SEEK_SET, MYF(0)) !=
         MY_FILEPOS_ERROR)) {
      /* No error, reset seek_not_done flag. 将光标在文件当前位置置于 pos_in_file 处*/
      info->seek_not_done = false;
    } else {
      /*
        If the seek failed and the error number is ESPIPE, it is because
        info->file is a pipe or socket or FIFO.  We never should have tried
        to seek on that.  See Bugs#25807 and #22828 for more info.
      */
      assert(my_errno() != ESPIPE);
      info->error = -1;
      return 1;
    }
  }

  /*
    Calculate, how much we are within a IO_SIZE block. Ideally this
    should be zero.
    计算文件偏移量 保证读取的文件内容是以IO_SIZE为单位，为了对齐读，按位与符号用来取余
  */
  diff_length = (size_t)(pos_in_file & (IO_SIZE - 1));

  /*
    If more than a block plus the rest of the current block is wanted,
    we do read directly, without filling the buffer.
  */
  if (Count >=
      (size_t)(IO_SIZE +
               (IO_SIZE - diff_length))) { /* Fill first intern buffer 如果需要读取的文件内容大于两个 IO_SIZE*/
               /*直接从文件中读取内容到buffer中，读取IO_SIZE * 2的倍数，剩余的仍然从缓存中读取*/
    size_t read_length;
    if (info->end_of_file <= pos_in_file) {
      /* End of file. Return, what we did copy from the buffer. 文件已经到末尾了*/
      info->error = (int)left_length;
      return 1;
    }
    /*
      Crop the wanted count to a multiple of IO_SIZE and subtract,
      what we did already read from a block. That way, the read will
      end aligned with a block.
    */
    /*设置读取长度为 为 IO_SIZE - diff_size 目的还是为了对齐 IO_SIZE*/
    length = (Count & (size_t) ~(IO_SIZE - 1)) - diff_length;
    if ((read_length = mysql_encryption_file_read(info, Buffer, length,
                                                  info->myflags)) != length) {
      /*
        If we didn't get, what we wanted, we either return -1 for a read
        error, or (it's end of file), how much we got in total.
      */
      info->error =
          (read_length == (size_t)-1 ? -1 : (int)(read_length + left_length));
      return 1;
    }
    Count -= length;
    Buffer += length;
    pos_in_file += length;
    left_length += length;
    diff_length = 0;           // 已经对齐了，设置 diff_length 为0
  }

  /*
    At this point, we want less than one and a partial block.
    We will read a full cache, minus the number of bytes, we are
    within a block already. So we will reach new alignment.
  */
  /*再次读取一个IO_CACHE_SIZE的文件内容到IO_CACHE中，把缓冲区重新填满*/
  max_length = info->read_length - diff_length;    
  /* We will not read past end of file. 如果文件剩余内容小于缓冲区，则直接读到文件尾*/
  if (info->type != READ_FIFO && max_length > (info->end_of_file - pos_in_file))
    max_length = (size_t)(info->end_of_file - pos_in_file);
  /*
    If there is nothing left to read,  
      we either are done, or we failed to fulfill the request.
    Otherwise, we read max_length into the cache.
    如果没有内容可读，要么是完成了，要么是失败了
  */
  if (!max_length) {
    if (Count) {
      /* We couldn't fulfil the request. Return, how much we got. 判断 Count 还有要读取的内容但是我们读不到了，返回我们已经读取了多少 */
      info->error = (int)left_length;
      return 1;
    }
    length = 0; /* Didn't read any chars */
  } else if ((length = mysql_encryption_file_read(
                  info, info->buffer, max_length, info->myflags)) < Count ||
             length == (size_t)-1) {
    /*
      We got an read error, or less than requested (end of file).
      If not a read error, copy, what we got.
    */
    if (length != (size_t)-1) memcpy(Buffer, info->buffer, length);
    info->pos_in_file = pos_in_file;
    /* For a read error, return -1, otherwise, what we got in total. */
    info->error = length == (size_t)-1 ? -1 : (int)(length + left_length);
    info->read_pos = info->read_end = info->buffer;
    return 1;
  }
  /*
    Count is the remaining number of bytes requested.
    length is the amount of data in the cache.
    Read Count bytes from the cache.
  */
  /*更新变量read_pos、read_end、pos_in_file的值*/
  info->read_pos = info->buffer + Count;
  info->read_end = info->buffer + length;
  info->pos_in_file = pos_in_file;
  memcpy(Buffer, info->buffer, Count);
  return 0;
}

注意的是，每次读取均是以IO_CACHE为单位，若是读取的起始位置不为文件头，则经过diff_length进行对齐。info->buffer读取到的buffer的起始位置，info->read_pos为读取到的当前，info->read_end为buffer的结束位置。info->pos_in_file为当前读取的buffer在文件中的位置，下次读取文件的起始位置。经过info->read_end - info->read_pos计算buffer中剩余的数据长度。

写函数my_b_write

在write_cache写入模式中，IO_CACHE申请一块buffer做为写入缓存，将须要写入文件的内容先写入到缓存buffer中，而后统一写入到文件中，间接达到合并IO操做的目的，提升IO的性能。

inline int my_b_write(IO_CACHE *info, const uchar *buffer, size_t count) {
  if (info->write_pos + count <= info->write_end) {
    //如果缓冲区的剩余空间足够写入，就直接写入缓冲区
    memcpy(info->write_pos, buffer, count);
    info->write_pos += count;
    return 0;
  }
  // 否则调用写入函数
  return (*info->write_function)(info, buffer, count);
}

写入函数 _my_b_write 首先将须要写入的内容所有写入剩余的缓存中，如果缓冲满了就刷新到磁盘。

  /*将内容写入到剩余的缓冲中*/
  rest_length = (size_t)(info->write_end - info->write_pos);
  memcpy(info->write_pos, Buffer, (size_t)rest_length);
  Buffer += rest_length;
  Count -= rest_length;
  info->write_pos += rest_length;
  /*将缓存中的内容刷入文件*/
  if (my_b_flush_io_cache(info, 1)) return 1;

  /*如果剩余长度大于 IO_SIZE 以IO_SIZE为单位，将全部大于IO_SIZE的内容直接写入文件*/
  if (Count >= IO_SIZE) { /* Fill first intern buffer */
    length = Count & (size_t) ~(IO_SIZE - 1);  /*以IO_SIZE对齐*/
 if (DBUG_EVALUATE_IF("verify_mysql_encryption_file_write_bytes", false,
                         true)) {
      if (mysql_encryption_file_write(info, Buffer, length,
                                      info->myflags | MY_NABP))
        return info->error = -1;
}
  /*将剩余的buffer内容copy到IO_CACHE中*/
  memcpy(info->write_pos, Buffer, (size_t)Count);
  info->write_pos += Count;

刷新缓存进入文件函数 my_b_flush_io_cache info->write_buffer为写入缓存的起始位置，info->write_pos为写入到缓存中的当前位置，info->write_end为能够写入缓存的最大位置

    if ((length = (size_t)(info->write_pos - info->write_buffer))) {
      /* length 是写缓冲区中数据的长度 */
      pos_in_file = info->pos_in_file;   // 保存一下，后续会用到

      if (!append_cache) info->pos_in_file += length;      // 设置 pos_in_file 位置加上当前缓冲区长度
      /* 获得总的 buffer 大小，然后减去当前文件 pos + length之和。 剩余部分就是文件中还没有对齐的地方
         下一个写入这么大的数据，就可以满足一次对齐写。所以当 buffer 比较小的时候，为了满足这种对齐的要求，可使用的 buffer 就会比较小，
         从而触发更多的文件 IO 操作。  如果 buffer 是满的，那必定是对齐的
      */
      info->write_end = (info->write_buffer + info->buffer_length -
                         ((pos_in_file + length) & (IO_SIZE - 1)));

      if (mysql_encryption_file_write(info, info->write_buffer, length,    // 写入文件
                                      info->myflags | MY_NABP))

      info->append_read_pos = info->write_pos = info->write_buffer;    // 充值 write_pos 和 append_read_pos
      ++info->disk_writes;     // 刷盘次数 + 1
    }

SEQ_READ_APPEND 模式

这个函数用于 SEQ_READ_APPEND 模式，SEQ_READ_APPEND模式为一种读写共用模式，对同一个文件，同时申请两块buffer用做文件缓存，一块用做读文件的buffer，另外一块用做写文件的buffer。当读取到文件位置与写入文件的位置发生重合时，能够直接在写缓存中直接读取，减小读操做读取文件的io操做，以提升IO效率。 IO线程在写入relay log时使用这种模式来打开文件，若是SQL线程与IO线程读取相同的文件时，IO线程和SQL线程使用相同的IO_CACHE。

初始化大致流程同READ_CACHE模式相同，可是须要申请两块buffer分别用于读写。初始化过程如下：

int init_io_cache_ext(IO_CACHE *info, File file, size_t cachesize, ...cache_myflags... 
| if (type == SEQ_READ_APPEND)
    buffer_block *= 2;                                                                   /*将bufer_size 扩大到双倍*/
| info->buffer= (uchar*) my_malloc(......,buffer_block, flags)
| if (type == SEQ_READ_APPEND)
     /*info->buffer和info->write_buffer分别指向两块buffer的起始位置*/
     info->write_buffer = info->buffer + cachesize;                      
| if (type == SEQ_READ_APPEND)
    info->append_read_pos = info->write_pos = info->write_buffer;            /*append_read_pos为读取时使用*/
    info->write_end = info->write_buffer + info->buffer_length;
    mysql_mutex_init(key_IO_CACHE_append_buffer_lock,......               /*读取和写入同一块buffer使用的锁*/
| init_functions
    info->read_function = _my_b_seq_read;
    info->write_function = 0;                                                             /*不使用*/

函数 my_b_append

使用单独的append函数来写入，全部的写入均append到文件的结尾。buffer的使用同WRITE_CACHE模式下的写入是相同的，在buffer充足的状况下，首先写入buffer，若是buffer空间不足的状况下，先将buffer中的内容append到文件结尾，再继续使用buffer。

int my_b_append(IO_CACHE *info, const uchar *Buffer, size_t Count)
| lock_append_buffer(info);
| rest_length= (size_t) (info->write_end - info->write_pos);                   /*计算buffer剩余空间*/
/*若是剩余空间充足*/
| memcpy(info->write_pos,Buffer,(size_t) Count);                               /*直接写入缓存buffer中*/
| info->write_pos+=Count;

/*若是剩余空间不足*/  
| memcpy(info->write_pos, Buffer, rest_length);                                 /*先将剩余的内容先写入到buffer中*/
| length=Count & (size_t) ~(IO_SIZE-1);                                            /*以IO_SIZE对齐*/
| my_b_flush_io_cache(info,0)                                                           /*将缓存中的内容append到文件中，重置写buffer*/
| memcpy(info->write_pos,Buffer,(size_t) Count);                              /*将对齐后剩余的内容写入新的buffer*/
| info->write_pos+=Count;

| unlock_append_buffer(info);

函数 _b_my_seq_read

若是buffer不足使用读取函数读取，读取函数与READ_CACHE模式使用的读取函数不一样。

_my_b_seq_read
| if ((left_length=(size_t) (info->read_end-info->read_pos)))
     DBUG_ASSERT(Count > left_length);                   读取的长度必定大于read_buffer中剩余的
| lock_append_buffer(info);
| if ((pos_in_file=info->pos_in_file + (size_t) (info->read_end - info->buffer)) >= info->end_of_file)                                                                /*判断是否读取到了尾端，开始读取write_buffer*/
|      goto read_append_buffer;

没有读取到write_buffer
| if (mysql_file_seek(info->file, pos_in_file, MY_SEEK_SET, MYF(0)) == MY_FILEPOS_ERROR)
| 正常读取，与READ_CACHE模式的读取方式相同

读取到write_buffer
  /*初始时info->append_read_pos 等于info->write_pos， 写入后info->write_pos会增长 */
| len_in_buff = (size_t) (info->write_pos - info->append_read_pos);  
| copy_len= MY_MIN(Count, len_in_buff);                      /*取较小的copy到buffer中*/

/*将须要读取长度copy到buffer中*/
| memcpy(Buffer, info->append_read_pos, copy_len);
| info->append_read_pos += copy_len;

/*将write_buffer中剩余的字节读取到read_buffer中*/
| memcpy(info->buffer, info->append_read_pos,
       (size_t) (transfer_len=len_in_buff - copy_len));         /*将剩余的字节copy到read buffer中*/
| info->read_pos= info->buffer;
  info->read_end= info->buffer+transfer_len;                  /*剩余的全部字节做为 read_end*/        
  info->append_read_pos=info->write_pos;                        /*置append_read_pos到write_pos*/
  info->pos_in_file=pos_in_file+copy_len;                             /* info->buffer在文件中的起始位置*/ 
  info->end_of_file+=len_in_buff;                                          /*可读的最大长度*/

info->read_pos、info->read_end、info->buffer的使用与READ_CACHE模式下使用的相同。

info->write_pos、info->write_end、info->write_buffer的使用与WRITE_CACHE模式下的使用的相同。

info->append_read_pos为读取IO_CACHE时，读取到了write_buffer（文件末尾）时使用的变量，经过info->write_pos - info-> append_read_pos来计算能够从write_buffer中读取到的字节长度。

总结：

IO_CACHE模块是mysql用来提高IO性能的模块，对于长度小于IO_SIZE（默认为4096字节）的文件操做（读取或者写入），该模块能够达到合并小的IO操做，提高IO性能的目的。若是每次读取或者写入文件的长度若是超过IO_SIZE * 2，那么每次IO_CACHE模块会直接操做文件，并不会达到提高IO性能的目的。

另外须要注意的是SEQ_READ_APPEND模式，该模式为IO线程和SQL线程操做同一文件时来提高性能的模式，在5.7以及5.7版本前均是使用该模式来打开IO线程和SQL线程共同使用的relay log。可是因为锁的缘由，在mysql-8版本中，IO线程和SQL线程分别使用独自的IO_CACHE，再也不使用同一个IO_CACHE的SEQ_READ_APPEND模式。

注：本文参考了以下两篇博客，部分直接引用了原博客内容。

参考博客：https://zhuanlan.zhihu.com/p/414142517

http://www.javashuo.com/article/p-cywhccvd-k.html

会员力量，点亮园子希望

刷新页面返回顶部

面具下的戏命师

公告

MySQL 源码解读之 IO_CACHE