NVMe plus NVMe-oF

http://www.ssdfans.com/?p=8137 蛋蛋读nvme之一
http://www.ssdfans.com/?p=8139 蛋蛋读nvme之二
http://www.ssdfans.com/?p=8141 蛋蛋读nvme之三
http://www.ssdfans.com/?p=8146 蛋蛋读nvme之四
http://www.ssdfans.com/?p=8148 蛋蛋读nvme之五
http://www.ssdfans.com/?p=8157 蛋蛋读nvme之六

基于上面文章,记录下学习过程,疑问

蛋蛋读nvme之一

  1. AHCI,NVMe最大命令队列个数,深度;为什么不增大AHCI的命令队列个数,深度?
    AHCI 1/32, NVMe 64k/64k;软硬件兼容性问题;AHCI用于管理SATA接口的硬盘(HDD/SSD),支持NCQ(命令队列)等功能,SATA协议最大速度600MB/s。虽然后来有M.2接口M key SSD支持PCIE协议,但市场上没有SSD主控会支持AHCI协议,都是用NVMe协议。

  2. 如何用工具测试PC上SATA盘速度,NVMe盘速度?哪些时候会使用顺序读写/随机读写?
    使用CrystalDiskMark测试PCIe4.0✖4数据:
    [Read]
    SEQ 1MiB (Q= 8, T= 1): 6835.239 MB/s [ 6518.6 IOPS] < 1226.48 us>
    SEQ 128KiB (Q= 32, T= 1): 6969.867 MB/s [ 53175.9 IOPS] < 601.49 us>
    RND 4KiB (Q= 32, T=16): 4459.859 MB/s [1088832.8 IOPS] < 469.77 us>
    RND 4KiB (Q= 1, T= 1): 94.125 MB/s [ 22979.7 IOPS] < 43.43 us>
    [Write]
    SEQ 1MiB (Q= 8, T= 1): 5002.495 MB/s [ 4770.8 IOPS] < 1675.04 us>
    SEQ 128KiB (Q= 32, T= 1): 5169.363 MB/s [ 39439.1 IOPS] < 810.83 us>
    RND 4KiB (Q= 32, T=16): 4355.104 MB/s [1063257.8 IOPS] < 481.13 us>
    RND 4KiB (Q= 1, T= 1): 276.426 MB/s [ 67486.8 IOPS] < 14.74 us>
    [Mix] Read 70%/Write 30%
    SEQ 1MiB (Q= 8, T= 1): 6077.409 MB/s [ 5795.9 IOPS] < 1376.67 us>
    SEQ 128KiB (Q= 32, T= 1): 5668.832 MB/s [ 43249.8 IOPS] < 739.20 us>
    RND 4KiB (Q= 32, T=16): 4269.803 MB/s [1042432.4 IOPS] < 490.15 us>
    RND 4KiB (Q= 1, T= 1): 111.143 MB/s [ 27134.5 IOPS] < 36.76 us>

  3. 为什么NVMe-oF没有使用创建CQ,SQ队列的命令?
    Message-based controllers use the Connect command (refer to section 6.3) to create controller Admin Queues or I/O Queues. The creation of an Admin Queue establishes an association between a host and
    the corresponding controller
    .The message-based transport queue model does not support the Admin commands associated with I/O Queue creation and deletion (Create I/O Completion Queue, Create I/O
    Submission Queue, Delete I/O Completion Queue, Delete I/O Submission Queue).
    RDMA Qpair连接建立后,其Send Queue/Recv Queue和NVMe中的SQ/CQ一一对应。

  4. 为什么使用MSI-X中断?

    特性 传统中断 MSI-X中断
    触发方式 物理信号线(电平/边沿) 内存写入(Message)
    中断向量数量 有限(IRQ共享) 多(支持数千个独立向量)
    多核优化 弱(静态绑定核心) 强(动态负载均衡)
    延迟 较高 更低
    虚拟化支持 依赖模拟 原生支持(如SR-IOV)
    典型设备 低速外设 高速网络/存储设备
    特性 硬件中断 软件中断
    触发源 外部设备或硬件模块 程序执行特定指令(如INT n)
    中断控制器依赖 必须经过 不经过
    优先级仲裁 由中断控制器管理 由CPU指令顺序隐式控制
    典型应用 外设实时事件(按键、定时器) 系统调用、调试断点
    延迟 依赖中断控制器和硬件响应 极低(立即触发)

    硬件中断是由硬件事件触发,而软件中断是由程序执行特定指令触发的。
    外部引脚中断(按键检测)、定时器中断、串口接收中断和ADC转换完成中断。这些都属于硬件中断,因为它们都是由硬件事件触发的,例如引脚电平变化、定时器溢出、数据接收完成或ADC转换完成。

    硬件中断的通用流程

    • 事件发生:硬件检测到特定条件(如引脚电平变化、定时器溢出)。
    • 中断请求:硬件通过物理中断线(IRQ)或消息信号(MSI/MSI-X)向中断控制器发送中断请求。
    • 中断响应:
      • 中断控制器进行优先级仲裁/路由,将中断信号路由到目标CPU核心(多核系统中)
      • 若中断已使能且优先级允许,CPU暂停当前任务,保存现场。
      • 跳转到中断服务程序(ISR)执行。
    • 中断处理:ISR中清除中断标志,处理事件(如读取数据、翻转引脚)。
    • 中断返回:恢复现场,CPU继续执行原任务。

    软件中断的通用流程

    • 事件发生:程序执行特定指令触发软件中断(如x86的INT 0x80用于Linux系统调用),CPU根据指令参数生成中断向量号(如x86中INT 0x80对应向量号128)。
    • 中断响应与上下文切换:
      • CPU完成当前指令后,检测到软件中断请求,暂停当前程序执行。
      • 特权级切换:若目标中断处理程序需更高特权级(如从用户态切换到内核态),CPU自动切换特权级(如x86从Ring 3→Ring 0)。
      • 上下文保存:CPU将当前程序的现场信息压入栈(或专用寄存器),包括:程序计数器(PC/IP)、状态寄存器(EFLAGS/CPSR)、通用寄存器等。
    • 中断处理程序执行:
      • 中断向量表查询,CPU根据中断向量号,查询中断描述符表(IDT)或异常向量表(VBAR),获取中断服务程序(ISR)入口地址。
      • CPU跳转到ISR入口地址,开始执行中断处理代码。
    • 中断返回与恢复
      • 恢复上下文:ISR执行完毕后,通过特定指令(如x86的IRET、ARM的RFED)从栈中恢复原程序的寄存器状态。
      • 特权级切换回退:CPU切换回原程序的特权级(如从内核态Ring 0→用户态Ring 3)。
      • 继续原程序执行:CPU从保存的程序计数器(PC/IP)位置继续执行被中断的程序。

    软件中断的典型应用场景-系统调用(Syscall)流程
    用户程序调用write()函数。
    库函数封装为INT 0x80或syscall指令,传递系统调用号(如__NR_write=4)。
    CPU切换到内核态,执行sys_write()函数。
    处理完成后返回用户态,继续执行用户程序。

  5. 先创建CQ还是SQ?

    • 对于NVMe,先创建CQ再创建SQ
      Creation and deletion of memory-based transport Submission Queue and associated Completion Queues are required to be ordered correctly by host software. Host software creates the Completion Queue before
      creating any associated Submission Queue. Submission Queues may be created at any time after the associated Completion Queue is created. Host software deletes all associated Submission Queues prior to
      deleting a Completion Queue. To abort all commands submitted to the Submission Queue host software issues a Delete I/O Submission Queue command for that queue (refer to section 3.3.1.3).
      因为NVMe支持多队列,故SQ时需要和CQ绑定,即在创建SQ时需要指定一个已存在的CQ ID(通过SQ Head寄存器的CQID字段)。
    struct {
    	/* Physically Contiguous */
    	uint32_t pc       : 1;
    	/* Queue Priority */
    	uint32_t qprio    : 2;
    	uint32_t reserved : 13;
    	/* Completion Queue Identifier */
    	uint32_t cqid     : 16;
    } create_io_sq;
    
    • 对于NVMe-oF,没有先后顺序,ibv_create_qp创建QP时一次性创建SQ和CQ。
  6. Doorbell Register源码,Linux内核写head和tail源码?

    /* 写cq_head doorbell */
    static inline int nvme_process_cq(struct nvme_queue *nvmeq)
    {
    	int found = 0;
    
    	while (nvme_cqe_pending(nvmeq)) {
    		found++;
    		/*
    		 * load-load control dependency between phase and the rest of
    		 * the cqe requires a full read memory barrier
    		 */
    		dma_rmb();
    		nvme_handle_cqe(nvmeq, nvmeq->cq_head);
    		nvme_update_cq_head(nvmeq);
    	}
    
    	if (found)
    		nvme_ring_cq_doorbell(nvmeq);
    	return found;
    }
    static inline void nvme_ring_cq_doorbell(struct nvme_queue *nvmeq)
    {
    	u16 head = nvmeq->cq_head;
    
    	if (nvme_dbbuf_update_and_check_event(head, nvmeq->dbbuf_cq_db,
    					      nvmeq->dbbuf_cq_ei))
    		writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
    }
    
    /* 写sq_tail doorbell, 一次性接收blk_mq_queue的所有blk_request,最后一个request写doorbell */
    /*
     * Write sq tail if we are asked to, or if the next command would wrap.
     */
    static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
    {
    	if (!write_sq) {
    		u16 next_tail = nvmeq->sq_tail + 1;
    
    		if (next_tail == nvmeq->q_depth)
    			next_tail = 0;
    		if (next_tail != nvmeq->last_sq_tail)
    			return;
    	}
    
    	if (nvme_dbbuf_update_and_check_event(nvmeq->sq_tail,
    			nvmeq->dbbuf_sq_db, nvmeq->dbbuf_sq_ei))
    		writel(nvmeq->sq_tail, nvmeq->q_db);
    	nvmeq->last_sq_tail = nvmeq->sq_tail;
    }
    
    /**
     * nvme_submit_cmd() - Copy a command into a queue and ring the doorbell
     * @nvmeq: The queue to use
     * @cmd: The command to send
     * @write_sq: whether to write to the SQ doorbell
     */
    static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
    			    bool write_sq)
    {
    	spin_lock(&nvmeq->sq_lock);
    	memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
    	       cmd, sizeof(*cmd));
    	if (++nvmeq->sq_tail == nvmeq->q_depth)
    		nvmeq->sq_tail = 0;
    	nvme_write_sq_db(nvmeq, write_sq);
    	spin_unlock(&nvmeq->sq_lock);
    }
    
    /**
     * struct blk_mq_queue_data - Data about a request inserted in a queue
     *
     * @rq:   Request pointer.
     * @last: If it is the last request in the queue.
     */
    struct blk_mq_queue_data {
    	struct request *rq;
    	bool last;
    };
    
    static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
    			 const struct blk_mq_queue_data *bd)
    {
    	struct nvme_ns *ns = hctx->queue->queuedata;
    	struct nvme_queue *nvmeq = hctx->driver_data;
    	struct nvme_dev *dev = nvmeq->dev;
    	struct request *req = bd->rq;
    	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
    	struct nvme_command *cmnd = &iod->cmd;
    	blk_status_t ret;
    
    	iod->aborted = 0;
    	iod->npages = -1;
    	iod->nents = 0;
    
    	/*
    	 * We should not need to do this, but we're still using this to
    	 * ensure we can drain requests on a dying queue.
    	 */
    	if (unlikely(!test_bit(NVMEQ_ENABLED, &nvmeq->flags)))
    		return BLK_STS_IOERR;
    
    	ret = nvme_setup_cmd(ns, req, cmnd);
    	if (ret)
    		return ret;
    
    	if (blk_rq_nr_phys_segments(req)) {
    		ret = nvme_map_data(dev, req, cmnd);
    		if (ret)
    			goto out_free_cmd;
    	}
    
    	if (blk_integrity_rq(req)) {
    		ret = nvme_map_metadata(dev, req, cmnd);
    		if (ret)
    			goto out_unmap_data;
    	}
    
    	blk_mq_start_request(req);
    	nvme_submit_cmd(nvmeq, cmnd, bd->last);
    	return BLK_STS_OK;
    out_unmap_data:
    	nvme_unmap_data(dev, req);
    out_free_cmd:
    	nvme_cleanup_cmd(req);
    	return ret;
    }
    
  7. NVMe-oF建连流程
    An NVMe Transport connection is established between a host and an NVM subsystem prior to the transfer of any capsules or data。
    NVMe-oF首先会建立一个RDMA连接,即Admin Queue,然后host向AQ发送connect命令初始化Controller

蛋蛋读nvme之二

  1. SQ和CQ一对一?还是多对一或者多对一?NVMe-oF支持非一对一吗?
    When using a memory-based transport queue model, multiple Submission Queues may utilize the same Completion Queue.
    When using a message-based transport queue model each Submission Queue maps to a single Completion Queue.
    NVMe支持一对一,多对一;NVMe-oF仅支持一对一。

  2. SQ和CQ什么场景会使用多对一?需要注意什么?

  3. Admin Queue只能执行Admin Command Set?IO Queue只能执行I/O Command Set?NVMe-oF怎么设计Admin和I/O Queue的?
    是的。NVMe-oF允许创建一个Admin Queue,最多128个I/O Queue。
    在_nvmf_ctrlr_connect处理Fabric connect时,如果是Admin Queue(qid == 0),则会动态创建controller。

     if (cmd->qid == 0) {
     	if (cmd->sqsize >= transport->opts.max_aq_depth) {
     		SPDK_ERRLOG("Invalid SQSIZE for admin queue %u (min 1, max %u)\n",
     			    cmd->sqsize, transport->opts.max_aq_depth - 1);
     		SPDK_NVMF_INVALID_CONNECT_CMD(rsp, sqsize);
     		return SPDK_NVMF_REQUEST_EXEC_STATUS_COMPLETE;
     	}
     } else if (cmd->sqsize >= transport->opts.max_queue_depth) {
     	SPDK_ERRLOG("Invalid SQSIZE %u (min 1, max %u)\n",
     		    cmd->sqsize, transport->opts.max_queue_depth - 1);
     	SPDK_NVMF_INVALID_CONNECT_CMD(rsp, sqsize);
     	return SPDK_NVMF_REQUEST_EXEC_STATUS_COMPLETE;
     }
    
     if (cmd->qid == 0) {
     	SPDK_DEBUGLOG(nvmf, "Connect Admin Queue for controller ID 0x%x\n", data->cntlid);
     	if (spdk_nvme_trtype_is_fabrics(transport->ops->type) && data->cntlid != 0xFFFF) {
     		/* This NVMf target only supports dynamic mode. */
     		SPDK_ERRLOG("The NVMf target only supports dynamic mode (CNTLID = 0x%x).\n", data->cntlid);
     		SPDK_NVMF_INVALID_CONNECT_DATA(rsp, cntlid);
     		return SPDK_NVMF_REQUEST_EXEC_STATUS_COMPLETE;
     	}
     	/* Establish a new ctrlr */
     	ctrlr = nvmf_ctrlr_create(subsystem, req, cmd, data);
     	if (!ctrlr) {
     		SPDK_ERRLOG("nvmf_ctrlr_create() failed\n");
     		rsp->status.sc = SPDK_NVME_SC_INTERNAL_DEVICE_ERROR;
     		return SPDK_NVMF_REQUEST_EXEC_STATUS_COMPLETE;
     	} else {
     		return SPDK_NVMF_REQUEST_EXEC_STATUS_ASYNCHRONOUS;
     	}
     } else {
     	spdk_thread_send_msg(subsystem->thread, _nvmf_ctrlr_add_io_qpair, req);
     	return SPDK_NVMF_REQUEST_EXEC_STATUS_ASYNCHRONOUS;
     }
    
  4. NVMe队列深度和队列个数协商过程?一般协商结果是多少?NVMe-oF呢?

  5. 对于SQ,Controller使用CQE中的2字节SQ head pointer通知,但对于CQ的tail,却只使用1 bit的Phase bit来通知呢?
    Host software checks completion queue entry Phase Tag (P) bits in memory to determine whether new completion queue entries have been posted (refer to section 3.3.3.2.2). The Completion Queue Tail
    pointer
    is only used internally by the controller and is not visible to the host. The controller uses the SQ Head Pointer (SQHD) field in completion queue entries to communicate new values of the
    Submission Queue Head Pointer to the host. A new SQHD value indicates that submission queue entries have been consumed, but does not indicate either execution or completion of any command. Refer to
    section 3.3.3.2.
    当多个SQ对应一个CQ时,需要指定SQHD;一个SQ,一个CQ时,理论上无需SQHD。
    Host通过Phase bit是否翻转来检测是否有新的CQE,而不是判断Phase bit是否置1(CQE处理完后还要置0)。对于每个CQ,会记录phase bit状态,初始值为0,CQ队列循环一遍完后,会将其翻转。通过对比该phase bit和CQE中的phase bit tag来判断是
    否发生翻转。

    while (1) {
     	cpl = &pqpair->cpl[pqpair->cq_head];
     	if (!next_is_valid && cpl->status.p != pqpair->flags.phase) {
     		break;
     	}
     	if (spdk_likely(pqpair->cq_head + 1 != pqpair->num_entries)) {
     		next_cq_head = pqpair->cq_head + 1;
     		next_phase = pqpair->flags.phase;
     	} else {
     		next_cq_head = 0;
     		next_phase = !pqpair->flags.phase;
     	}
     	next_cpl = &pqpair->cpl[next_cq_head];
     	next_is_valid = (next_cpl->status.p == next_phase);
         /* ... */
         if (spdk_unlikely(++pqpair->cq_head == pqpair->num_entries)) {
     	   pqpair->cq_head = 0;
     	   pqpair->flags.phase = !pqpair->flags.phase;
         }
         /* ... */
    }
    
  6. 对于NVMe-oF,SQ/CQ的head/tail更新源码?

  7. Host什么时候会使用SQ head以及CQ tail?NVMe-oF及内核源码?

蛋蛋读nvme之三

  1. PRP Entry可以描述多页吗?
  2. NVMe CMD如何描述含多个PRP Entry的PRP List的?
  3. NVMe CMD如何描述含多个SGL Descriptor的SGL List的?为什么需要Segment这个概念?
  4. NVMe CMD什么时候会使用到SGL Last Segment?Bit Bucket?Segment?
  5. NVMe-oF对SGL的转换处理源码?

蛋蛋读nvme之四

  1. NVMe CMD执行的时候,写了几次Doorbell寄存器?
  2. Controller如何感知到Doorbell寄存器值变化的?
  3. SSD中断Host,NVMe/PCIe的四种方式
    • Pin-based interrupt
    • single message MSI
    • multiple message MSI
    • MSI-X
  4. Host怎么能准确更新该SSD Controller中的Tail DB寄存器呢?memory-map背后的寻址原理?
  5. PCIe-Write/Read TLP过程需要CPU参与吗?

蛋蛋读nvme之五

  1. crc工作机制,512bytes->2bytes?
  2. Application tag和Reference tag工作机制?
  3. Protection Information怎么传输的?

蛋蛋读nvme之Final

  1. 怎么标识一个Controller?host+subsystem?
  2. NVMe-oF一个subsystem可以创建多个Controller吗?创建多个Controller目的是什么?
  3. NVMe使用Controller控制NS访问权限,NVMe-oF怎么控制NS访问权限呢?
  4. SR-IOV?Physical Function和Virtual Function?
  5. 多NS,多Controller,多PCIe Port,多Subsystem的使用场景?
posted @ 2025-03-10 10:55  3yearleft  阅读(66)  评论(0)    收藏  举报