MySQL源码--参数innodb_flush_log_at_trx_commit学习

innodb_flush_log_at_trx_commit参数

参数innodb_flush_log_at_trx_commit的取值有:

  • 0:logs are written and flushed to disk once per second.
  • 1:Logs are written and flushed to disk at each transaction commit.
  • 2:logs are written after each transaction commit and flushed to disk once per second.

当innodb_flush_log_at_trx_commit=0时,事务提交无需做额外操作,无论是数据库实例宕机还是数据库服务器宕机都会造成数据丢失,性能最好但安全性最差。

当innodb_flush_log_at_trx_commit=1时,事务提交时需要将事务日志写入到文件系统缓存并刷新到磁盘,无论是数据库实例宕机还是数据库服务器宕机都不会造成数据丢失,安全性最好但性能最差。

当innodb_flush_log_at_trx_commit=2时,事务提交时需要将事务日志刷新到文件系统缓存,但不需要刷新到磁盘,当数据库实例宕机时不会造成数据丢失,当数据库服务器宕机时会造成数据丢失,安全性和性能都适中。

在MySQL主线程中,有一个循环执行的脚本,每次执行完会休眠1秒:

  • 当循环中操作执行耗时较小时,能基本上保证循环每秒执行1次。
  • 当循环中操作执行耗时较大时,会导致循环操作要N秒才能执行1次。

在这个循环中,无论是srv_master_do_active_tasks还是srv_master_do_idle_tasks,都会调用函数srv_sync_log_buffer_in_background来将事务日志写入到文件系统缓存并刷新到磁盘,函数srv_sync_log_buffer_in_background执行过程中不会考虑参数innodb_flush_log_at_trx_commit的取值,因此无论参数取值0或2,都会每秒刷盘1次。

MySQL InnoDB存储引擎提交事务过程中函数调用:

innobase_commit()
	====>trx_commit_complete_for_mysql()
		====>trx_flush_log_if_needed()
			====>log_write_up_to()--for ddl
			====>trx_flush_log_if_needed_low()--for dml
				====>log_write_up_to()

ps: 下面代码基于MySQL 8.0.26

配置参数和系统变量映射

参数innodb_flush_log_at_trx_commit最终映射到srv_flush_log_at_trx_commit系统变量上,并有srv_flush_log_at_trx_commit决定如何写缓存和刷盘。

## 文件storage\innobase\handler\ha_innodb.cc
static MYSQL_SYSVAR_ULONG(flush_log_at_trx_commit, srv_flush_log_at_trx_commit,
                          PLUGIN_VAR_OPCMDARG,
                          "Set to 0 (write and flush once per second),"
                          " 1 (write and flush at each commit),"
                          " or 2 (write at commit, flush once per second).",
                          nullptr, nullptr, 1, 0, 2, 0);

每秒将事务日志文件刷盘一次

在mater thread中,每次循环都会调用srv_master_sleep来休眠1秒,但不能保证每秒循环一次。

## 文件storage\innobase\srv\srv0srv.cc

/** Executes the main loop of the master thread.
@param[in]   slot     slot reserved as SRV_MASTER */
static void srv_master_main_loop(srv_slot_t *slot) {
  if (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND) {
    /* When innodb_force_recovery is at least SRV_FORCE_NO_BACKGROUND,
    we avoid performing active/idle master's tasks. However, we still
    need to ensure that:
      srv_shutdown_state >= SRV_SHUTDOWN_PRE_DD_AND_SYSTEM_TRANSACTIONS,
    after we exited srv_master_main_loop(). Keep waiting until that
    is satisfied and then exit. */
    while (srv_shutdown_state.load() <
           SRV_SHUTDOWN_PRE_DD_AND_SYSTEM_TRANSACTIONS) {
      srv_master_wait(slot);
    }
    return;
  }

  ulint old_activity_count = srv_get_activity_count();

  while (srv_shutdown_state.load() <
         SRV_SHUTDOWN_PRE_DD_AND_SYSTEM_TRANSACTIONS) {
    srv_master_sleep();
	........
    if (srv_check_activity(old_activity_count)) {
      old_activity_count = srv_get_activity_count();
      srv_master_do_active_tasks();
    } else {
      srv_master_do_idle_tasks();
    }
    ........
  }
}

/** Perform the tasks that the master thread is supposed to do when the
 server is active. There are two types of tasks. The first category is
 of such tasks which are performed at each inovcation of this function.
 We assume that this function is called roughly every second when the
 server is active. The second category is of such tasks which are
 performed at some interval e.g.: purge, dict_LRU cleanup etc. */
static void srv_master_do_active_tasks(void) {
  ...............
  /* Flush logs if needed */
  log_buffer_sync_in_background();
}

/** Perform the tasks that the master thread is supposed to do whenever the
 server is idle. We do check for the server state during this function
 and if the server has entered the shutdown phase we may return from
 the function without completing the required tasks.
 Note that the server can move to active state when we are executing this
 function but we don't check for that as we are suppose to perform more
 or less same tasks when server is active. */
static void srv_master_do_idle_tasks(void) {
  ...............
  /* Flush logs if needed */
  log_buffer_sync_in_background();
}

函数srv_master_do_active_tasks和srv_master_do_idle_tasks都分别调用log_buffer_sync_in_background来进行写缓存和刷盘操作。

## storage\innobase\log\log0buf.cc
/** Writes the log buffer to the log file. It is intended to be called from
background master thread periodically. If the log writer threads are active,
this function writes nothing. */
void log_buffer_sync_in_background() {
  log_t &log = *log_sys;

  /* Just to be sure not to miss advance */
  log.recent_closed.advance_tail();

  /* If the log flusher thread is working, no need to call. */
  if (log.writer_threads_paused.load(std::memory_order_acquire)) {
    log.recent_written.advance_tail();
    log_buffer_flush_to_disk(log, true);
  }
}

void log_buffer_flush_to_disk(log_t &log, bool sync) {
  ut_a(!srv_read_only_mode);
  ut_a(!recv_recovery_is_on());

  const lsn_t lsn = log_get_lsn(log);

  log_write_up_to(log, lsn, sync);
}

函数log_buffer_flush_to_disk通过调用log_write_up_to(log,lsn,flush_to_disk=true)来写缓存并刷盘

## 文件storage\innobase\log\log0write.cc
Wait_stats log_write_up_to(log_t &log, lsn_t end_lsn, bool flush_to_disk) {
  ut_a(!srv_read_only_mode);

  /* If we were updating log.flushed_to_disk_lsn while parsing redo log
  during recovery, we would have valid value here and we would not need
  to explicitly exit because of the recovery. However we do not update
  the log.flushed_to_disk during recovery (it is zero).

  On the other hand, when we apply log records during recovery, we modify
  pages and update their oldest/newest_modification. The modified pages
  become dirty. When size of the buffer pool is too small, some pages
  have to be flushed from LRU, to reclaim a free page for a next read.

  When flushing such dirty pages, we notice that newest_modification != 0,
  so the redo log has to be flushed up to the newest_modification, before
  flushing the page. In such case we end up here during recovery.

  Note that redo log is actually flushed, because changes to the page
  are caused by applying the redo. */

  if (recv_no_ibuf_operations) {
    /* Recovery is running and no operations on the log files are
    allowed yet, which is implicitly deduced from the fact, that
    still ibuf merges are disallowed. */
    return (Wait_stats{0});
  }

  /* We do not need to have exact numbers and we do not care if we
  lost some increments for heavy workload. The value only has usage
  when it is low workload and we need to discover that we request
  redo write or flush only from time to time. In such case we prefer
  to avoid spinning in log threads to save on CPU power usage. */
  log.write_to_file_requests_total.store(
      log.write_to_file_requests_total.load(std::memory_order_relaxed) + 1,
      std::memory_order_relaxed);

  ut_a(end_lsn != LSN_MAX);

  ut_a(end_lsn % OS_FILE_LOG_BLOCK_SIZE == 0 ||
       end_lsn % OS_FILE_LOG_BLOCK_SIZE >= LOG_BLOCK_HDR_SIZE);

  ut_a(end_lsn % OS_FILE_LOG_BLOCK_SIZE <=
       OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE);

  ut_ad(end_lsn <= log_get_lsn(log));

  Wait_stats wait_stats{0};
  bool interrupted = false;

retry:
  if (log.writer_threads_paused.load(std::memory_order_acquire)) {
    /* the log writer threads are paused not to waste CPU resource. */
    wait_stats +=
        log_self_write_up_to(log, end_lsn, flush_to_disk, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be working. retry. */
      goto retry;
    }

    DEBUG_SYNC_C("log_flushed_by_self");
    return (wait_stats);
  }

  /* the log writer threads are working for high concurrency scale */
  if (flush_to_disk) {
    if (log.flushed_to_disk_lsn.load() >= end_lsn) {
      DEBUG_SYNC_C("log_flushed_by_writer");
      return (wait_stats);
    }

    if (srv_flush_log_at_trx_commit != 1) {
      /* We need redo flushed, but because trx != 1, we have
      disabled notifications sent from log_writer to log_flusher.

      The log_flusher might be sleeping for 1 second, and we need
      quick response here. Log_writer avoids waking up log_flusher,
      so we must do it ourselves here.

      However, before we wake up log_flusher, we must ensure that
      log.write_lsn >= lsn. Otherwise log_flusher could flush some
      data which was ready for lsn values smaller than end_lsn and
      return to sleeping for next 1 second. */

      if (log.write_lsn.load() < end_lsn) {
        /* 等待将日志写入到文件系统缓存 */
        wait_stats += log_wait_for_write(log, end_lsn, &interrupted);
      }
    }

    /* Wait until log gets flushed up to end_lsn. */
    /* 等待将日志刷新到存储 */
    wait_stats += log_wait_for_flush(log, end_lsn, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be paused. retry. */
      goto retry;
    }

    DEBUG_SYNC_C("log_flushed_by_writer");
  } else {
    if (log.write_lsn.load() >= end_lsn) {
      return (wait_stats);
    }

    /* Wait until log gets written up to end_lsn. */
    wait_stats += log_wait_for_write(log, end_lsn, &interrupted);

    if (UNIV_UNLIKELY(interrupted)) {
      /* the log writer threads might be paused. retry. */
      goto retry;
    }
  }

  return (wait_stats);
}

参数innodb_flush_log_at_trx_commit==1情况

在函数innobase_flush_logs和trx_flush_log_if_needed_low方法中,都会判断参数innodb_flush_log_at_trx_commit是否等于1,如果等于则最终调用log_write_up_to来进行写缓存和刷盘操作。

## 文件storage\innobase\handler\ha_innodb.cc
/** Flush InnoDB redo logs to the file system.
@param[in]	hton			InnoDB handlerton
@param[in]	binlog_group_flush	true if we got invoked by binlog
group commit during flush stage, false in other cases.
@return false */
static bool innobase_flush_logs(handlerton *hton, bool binlog_group_flush) {
  DBUG_TRACE;
  assert(hton == innodb_hton_ptr);

  if (srv_read_only_mode) {
    return false;
  }

  /* If !binlog_group_flush, we got invoked by FLUSH LOGS or similar.
  Else, we got invoked by binlog group commit during flush stage. */

  if (binlog_group_flush && srv_flush_log_at_trx_commit == 0) {
    /* innodb_flush_log_at_trx_commit=0
    (write and sync once per second).
    Do not flush the redo log during binlog group commit. */

    /* This could be unsafe if we grouped at least one DDL transaction,
    and we removed !trx->ddl_must_flush from condition which is checked
    inside trx_commit_complete_for_mysql() when we decide if we could
    skip the flush. */
    return false;
  }

  /* Signal and wait for all GTIDs to persist on disk. */
  if (!binlog_group_flush) {
    auto &gtid_persistor = clone_sys->get_gtid_persistor();
    gtid_persistor.wait_flush(true, true, nullptr);
  }

  /* Flush the redo log buffer to the redo log file.
  Sync it to disc if we are in FLUSH LOGS, or if
  innodb_flush_log_at_trx_commit=1
  (write and sync at each commit). */
  log_buffer_flush_to_disk(!binlog_group_flush ||
                           srv_flush_log_at_trx_commit == 1);

  return false;
}

## 文件storage\innobase\log\log0buf.cc
void log_buffer_flush_to_disk(log_t &log, bool sync) {
  ut_a(!srv_read_only_mode);
  ut_a(!recv_recovery_is_on());

  const lsn_t lsn = log_get_lsn(log);

  log_write_up_to(log, lsn, sync);
}
## 文件storage\innobase\trx\trx0trx.cc
/** If required, flushes the log to disk based on the value of
 innodb_flush_log_at_trx_commit. */
static void trx_flush_log_if_needed_low(lsn_t lsn) /*!< in: lsn up to which logs
                                                   are to be flushed. */
{
#ifdef _WIN32
  bool flush = true;
#else
  bool flush = srv_unix_file_flush_method != SRV_UNIX_NOSYNC;
#endif /* _WIN32 */

  Wait_stats wait_stats;

  switch (srv_flush_log_at_trx_commit) {
    case 2:
      /* Write the log but do not flush it to disk */
      flush = false;
      /* fall through */
    case 1:
      /* Write the log and optionally flush it to disk */
      wait_stats = log_write_up_to(*log_sys, lsn, flush);

      MONITOR_INC_WAIT_STATS(MONITOR_TRX_ON_LOG_, wait_stats);

      return;
    case 0:
      /* Do nothing */
      return;
  }
}

参数innodb_flush_log_at_trx_commit==2情况

在上面trx_flush_log_if_needed_low方法中,使用switch来判定srv_flush_log_at_trx_commit的值,先判定srv_flush_log_at_trx_commit是否等于2,等于2时会对flush赋值为false,然后继续执行到case 1分支并执行log_write_up_to,此时flush=false,执行log_write_up_to函数会将日志刷到操作系统缓存但不刷新到存储。

模拟测试代码:

#include <iostream>
#include <string>

using namespace std;

int flush_data(){
	int srv_flush_log_at_trx_commit = 2;
	string flush = "刷盘";
	switch (srv_flush_log_at_trx_commit) {
    	case 2:
      		cout << "case 2: "<< flush << endl;
			flush = "不刷盘";
    	case 1:
      		cout << "case 1: " << flush << endl;
			cout << "开始写数据,刷盘策略: " << flush << endl;
			return 0;
    	case 0:
      		/* Do nothing */
      		return 0;
  	}
}

int main() {
	cout << "开始模拟测试......" << endl;
	flush_data();
	cout << "结束模拟测试......" << endl;
	return 0;
}

模拟输出结果:

开始模拟测试......
case 2: 刷盘
case 1: 不刷盘
开始写数据,刷盘策略: 不刷盘
结束模拟测试......

在早期的MySQL代码版本为:

/**********************************************************************//**
If required, flushes the log to disk based on the value of
innodb_flush_log_at_trx_commit. */
static __attribute__((nonnull))
void trx_flush_log_if_needed_low(
/*========================*/
	lsn_t	lsn)	/*!< in: lsn up to which logs are to be
			flushed. */
{
	switch (srv_flush_log_at_trx_commit) {
	case 0:
		/* Do nothing */
		break;
	case 1:
		/* Write the log and optionally flush it to disk */
		log_write_up_to(lsn, LOG_WAIT_ONE_GROUP,
				srv_unix_file_flush_method != SRV_UNIX_NOSYNC);
		break;
	case 2:
		/* Write the log but do not flush it to disk */
		log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);

		break;
	default:
		ut_error;
	}
}

老版本在每个case分支都有break,因此执行完case分支后就会跳出,而新版本代码更简洁,不懂c++的伙伴看起来会比较迷惑些。

posted @ 2021-08-05 09:12  TeyGao  阅读(486)  评论(0编辑  收藏  举报