MSI 中断

https://www.codenong.com/cs106676560/

 

MSI只支持32个中断向量,而MSI-X支持多达2048个中断向量,但是MSI-X的相关寄存器在配置空间中占用的空间却更小。这是因为中断向量信息并不直接存储在这里,而是在一款特殊的Memory(MIMO)中。并通过BIR(Base address Indicator Register, or BAR Index Register)来确定其在MIMO中的具体位置。无论是MSI还是MSI-X,其本质上都是基于Memory Write 的,因此也可能会产生错误。比如PCIe中的ECRC错误等。

如下图所示:

 Pending Table

Pending Table的组成结构如图6-4所示。




如上图所示,在Pending Table中,一个Entry由64位组成,其中每一位与MSI-X Table中的一个Entry对应,即Pending Table中的每一个Entry与MSI-X Table的64个Entry对应。与MSI机制类似,Pending位需要与Per Vector Mask位配置使用。
当Per Vector Mask位为1时,PCIe设备不能立即发送MSI-X中断请求,而是将对应的Pending位置1;当系统软件将Per Vector Mask位清零时,PCIe设备需要提交MSI-X中断请求,同时将Pending位清零。
[1] 此时PCI设备配置空间Command寄存器的“Interrupt Disable”位为1。
[2] MSI机制提交中断请求的方式类似与边界触发方式,而使用边界触发方式时,处理器可能会丢失某些中断请求,因此在设备驱动程序的开发过程中,可能需要使用这两个字段。

确认设备的MSI/MSI-X capability

lspci -v可以查看设备支持的capability, 如果有MSI或者MSI-x或者message signal interrupt的描述,并且这些描述后面都有一个enable的flag, “+”表示enable,"-"表示disable。

 

[root@localhost ixgbe]# lspci -xxx -vv -s 05:00.0
05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45)
        Subsystem: Huawei Technologies Co., Ltd. Device d139
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        NUMA node: 0
        Region 0: Memory at 80007b00000 (64-bit, prefetchable) [size=128K]
        Region 2: Memory at 80008a20000 (64-bit, prefetchable) [size=32K]
        Region 4: Memory at 80000200000 (64-bit, prefetchable) [size=1M]
        Expansion ROM at e9200000 [disabled] [size=1M]
        Capabilities: [40] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
                Vector table: BAR=2 offset=00000000
                PBA: BAR=2 offset=00004000
        Capabilities: [b0] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [c0] Vital Product Data
                Product Name: Huawei IN200 2*100GE Adapter
                Read-only fields:
                        [PN] Part number: SP572
                End
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 120, Total VFs: 120, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 1, stride: 1, Device ID: 375e
                Supported Page Size: 00000553, System Page Size: 00000010
                Region 0: Memory at 0000080007b20000 (64-bit, prefetchable)
                Region 2: Memory at 00000800082a0000 (64-bit, prefetchable)
                Region 4: Memory at 0000080000300000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [310 v1] #19
        Capabilities: [4e0 v1] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb
        Capabilities: [4f0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [600 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [630 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: vfio-pci
        Kernel modules: hinic
00: e5 19 00 02 06 04 10 00 45 00 00 02 08 00 00 00
10: 0c 00 b0 07 00 08 00 00 0c 00 a2 08 00 08 00 00
20: 0c 00 20 00 00 08 00 00 00 00 00 00 e5 19 39 d1
30: 00 00 40 e6 40 00 00 00 00 00 00 00 ff 00 00 00
40: 10 80 02 00 e2 8f 00 10 37 29 10 00 03 f1 43 00
50: 08 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 92 03 00 00 00 00 00 00 0e 00 00 00
70: 03 00 1f 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 05 a0 8a 01 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 11 b0 1f 80 02 00 00 00 02 40 00 00 00 00 00 00
b0: 01 c0 03 f8 00 00 00 00 00 00 00 00 00 00 00 00
c0: 03 00 28 80 37 32 78 ff 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[root@localhost ixgbe]# 

 

 

[root@localhost ixgbe]# lspci -n -v -s 06:00.0
06:00.0 0200: 19e5:0200 (rev 45)
        Subsystem: 19e5:d139
        Flags: fast devsel, NUMA node 0
        [virtual] Memory at 80010400000 (64-bit, prefetchable) [size=128K]
        [virtual] Memory at 80011320000 (64-bit, prefetchable) [size=32K]
        [virtual] Memory at 80008b00000 (64-bit, prefetchable) [size=1M]
        Expansion ROM at e9300000 [disabled] [size=1M]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+
        Capabilities: [a0] MSI-X: Enable- Count=32 Masked-
        Capabilities: [b0] Power Management version 3
        Capabilities: [c0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [200] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [310] #19
        Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-ec
        Capabilities: [4f0] Transaction Processing Hints
        Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [630] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: hinic

 

root@zj-x86:~# lspci -n -v -s 1a:00.1
1a:00.1 0200: 8086:37d0 (rev 09)
        Subsystem: 19e5:d123
        Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0
        Memory at a0000000 (64-bit, prefetchable) [size=16M]
        Memory at a3010000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at a3700000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [e0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

root@zj-x86:~# 

 

 

I350网卡位于bus 3,device0,function 0。从配置空间可以看出网卡申请了一个BAR3,这正是MSI-X所使用的BAR3,MSI-X tablestructure存放在BAR3起始地址+0的位置,PBA structure存在BAR3起始地址+0x2000的位置。

 

 

 

 

 

 

 

root@zj-x86:~# lspci | grep -i  ether
1a:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
1a:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
1a:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09)
root@zj-x86:~# lspci -s 1a:00.1  -v
1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
        Subsystem: Huawei Technologies Co., Ltd. Ethernet Connection X722 for 10GbE SFP+
        Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0
        Memory at a0000000 (64-bit, prefetchable) [size=16M]
        Memory at a3010000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at a3700000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [e0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

root@zj-x86:~# 

 

[root@localhost ~]# lspci -s 05:00.0 -v
05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45)
        Subsystem: Huawei Technologies Co., Ltd. Device d139
        Flags: fast devsel, NUMA node 0
        [virtual] Memory at 80007b00000 (64-bit, prefetchable) [size=128K]
        [virtual] Memory at 80008a20000 (64-bit, prefetchable) [size=32K]
        [virtual] Memory at 80000200000 (64-bit, prefetchable) [size=1M]
        Expansion ROM at e9200000 [disabled] [size=1M]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+
        Capabilities: [a0] MSI-X: Enable- Count=32 Masked-
        Capabilities: [b0] Power Management version 3
        Capabilities: [c0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [200] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [310] #19
        Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb
        Capabilities: [4f0] Transaction Processing Hints
        Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [630] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: hinic

[root@localhost ~]# 

 

MSI的中断注册

kernel/irq/manage.c

request_irq()
    +-> __setup_irq()
        +-> irq_activate()
               +-> msi_domain_activate()
                   // msi_domain_info中定义的irq_chip_write_msi_msg
                +-> irq_chip_write_msi_msg()
                    // irq_chip对应的是pci_msi_create_irq_domain中关联的its_msi_irq_chip
                    +-> data->chip->irq_write_msi_msg(data, msg);
                            +-> pci_msi_domain_write_msg()

 

 

从这个流程可以看出,MSI是通过irq_write_msi_msg往一个地址发一个消息来激活一个中断。

 

 

4. 设备怎么使用MSI/MSI-x中断?

传统中断在系统初始化扫描PCI bus tree时就已自动为设备分配好中断号, 但是如果设备需要使用MSI,驱动需要进行一些额外的配置。
当前linux内核提供pci_alloc_irq_vectors来进行MSI/MSI-X capablity的初始化配置以及中断号分配。

1
2
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
                unsigned int max_vecs, unsigned int flags);

函数的返回值为该PCI设备分配的中断向量个数。
min_vecs是设备对中断向量数目的最小要求,如果小于该值,会返回错误。
max_vecs是期望分配的中断向量最大个数。
flags用于区分设备和驱动能够使用的中断类型,一般有4种:

1
2
3
4
#define PCI_IRQ_LEGACY      (1 << 0) /* Allow legacy interrupts */
#define PCI_IRQ_MSI     (1 << 1) /* Allow MSI interrupts */
#define PCI_IRQ_MSIX        (1 << 2) /* Allow MSI-X interrupts */
#define PCI_IRQ_ALL_TYPES   (PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX)

PCI_IRQ_ALL_TYPES可以用来请求任何可能类型的中断。
此外还可以额外的设置PCI_IRQ_AFFINITY, 用于将中断分布在可用的cpu上。
使用示例:

1
 i = pci_alloc_irq_vectors(dev->pdev, min_msix, msi_count, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);

与之对应的是释放中断资源的函数pci_free_irq_vectors(), 需要在设备remove时调用:

1
void pci_free_irq_vectors(struct pci_dev *dev);

此外,linux还提供了pci_irq_vector()用于获取IRQ number.

1
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);

5. 设备的MSI/MSI-x中断是怎样处理的?

5.1 MSI的中断分配pci_alloc_irq_vectors()

深入理解下pci_alloc_irq_vectors()
pci_alloc_irq_vectors() --> pci_alloc_irq_vectors_affinity()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
                   unsigned int max_vecs, unsigned int flags,
                   struct irq_affinity *affd)
{
    struct irq_affinity msi_default_affd = {0};
    int msix_vecs = -ENOSPC;
    int msi_vecs = -ENOSPC;

    if (flags & PCI_IRQ_AFFINITY) {                        
        if (!affd)
            affd = &msi_default_affd;
    } else {
        if (WARN_ON(affd))
            affd = NULL;
    }

    if (flags & PCI_IRQ_MSIX) {
        msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
                            max_vecs, affd, flags);               ------(1)
        if (msix_vecs > 0)
            return msix_vecs;
    }

    if (flags & PCI_IRQ_MSI) {
        msi_vecs = __pci_enable_msi_range(dev, min_vecs, max_vecs,
                          affd);                             ----- (2)
        if (msi_vecs > 0)
            return msi_vecs;
    }

    /* use legacy IRQ if allowed */
    if (flags & PCI_IRQ_LEGACY) {
        if (min_vecs == 1 && dev->irq) {
            /*
             * Invoke the affinity spreading logic to ensure that
             * the device driver can adjust queue configuration
             * for the single interrupt case.
             */
            if (affd)
                irq_create_affinity_masks(1, affd);
            pci_intx(dev, 1);                                 ------ (3)
            return 1;
        }
    }

    if (msix_vecs == -ENOSPC)                
        return -ENOSPC;
    return msi_vecs;
}

(1) 先确认申请的是否为MSI-X中断

1
2
3
4
__pci_enable_msix_range()
    +-> __pci_enable_msix()
        +-> msix_capability_init()
            +-> pci_msi_setup_msi_irqs()

msix_capability_init会对msi capability进行一些配置。
关键函数pci_msi_setup_msi_irqs, 会创建msi irq number:

1
2
3
4
5
6
7
8
9
10
static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
    struct irq_domain *domain;

    domain = dev_get_msi_domain(&dev->dev);      
    if (domain && irq_domain_is_hierarchy(domain))
        return msi_domain_alloc_irqs(domain, &dev->dev, nvec);

    return arch_setup_msi_irqs(dev, nvec, type);
}

这里的irq_domain获取的是pcie device结构体中定义的dev->msi_domain.
这里的msi_domain是在哪里定义的呢?
在drivers/irqchip/irq-gic-v3-its-pci-msi.c中, kernel启动时会:

1
2
3
4
its_pci_msi_init()
    +-> its_pci_msi_init()
        +-> its_pci_msi_init_one()
            +-> pci_msi_create_irq_domain(handle, &its_pci_msi_domain_info,parent)

pci_msi_create_irq_domain中会去创建pci_msi irq_domain, 传递的参数分别是its_pci_msi_domain_info以及设置parent为its irq_domain.
所以现在逻辑就比较清晰:
在这里插入图片描述
gic中断控制器初始化时会去add gic irq_domain, gic irq_domain是its irq_domain的parent节点,its irq_domain中的host data对应的pci_msi irq_domain.

1
2
3
4
5
6
7
8
9
10
11
        gic irq_domain --> irq_domain_ops(gic_irq_domain_ops)
              ^                --> .alloc(gic_irq_domain_alloc)
              |
        its irq_domain --> irq_domain_ops(its_domain_ops)
              ^                --> .alloc(its_irq_domain_alloc)
              |                --> ...
              |        --> host_data(struct msi_domain_info)
              |            --> msi_domain_ops(its_msi_domain_ops)
              |                --> .msi_prepare(its_msi_prepare)
              |            --> irq_chip, chip_data, handler...
              |            --> void *data(struct its_node)

pci_msi irq_domain对应的ops:

1
2
3
4
5
6
static const struct irq_domain_ops msi_domain_ops = {
        .alloc          = msi_domain_alloc,
        .free           = msi_domain_free,
        .activate       = msi_domain_activate,
        .deactivate     = msi_domain_deactivate,
};

回到上面的pci_msi_setup_msi_irqs()函数,获取了pci_msi irq_domain后, 调用msi_domain_alloc_irqs()函数分配IRQ number.

1
2
3
4
5
msi_domain_alloc_irqs()
    // 对应的是its_pci_msi_ops中的its_pci_msi_prepare
    +-> msi_domain_prepare_irqs()
    // 分配IRQ number
    +-> __irq_domain_alloc_irqs()

msi_domain_prepare_irqs()对应的是its_msi_prepare函数,会去创建一个its_device.
__irq_domain_alloc_irqs()会去分配虚拟中断号,从allocated_irq位图中取第一个空闲的bit位作为虚拟中断号。

至此, msi-x的中断分配已经完成,且msi-x的配置也已经完成。

(2) 如果不是MSI-X中断, 再确认申请的是否为MSI中断, 流程与MSI-x类似。
(3) 如果不是MSI/MSI-X中断, 再确认申请的是否为传统intx中断

5.2 MSI的中断注册

kernel/irq/manage.c

1
2
3
4
5
6
7
8
9
request_irq()
    +-> __setup_irq()
        +-> irq_activate()
            +-> msi_domain_activate()
                // msi_domain_info中定义的irq_chip_write_msi_msg
                +-> irq_chip_write_msi_msg()
                    // irq_chip对应的是pci_msi_create_irq_domain中关联的its_msi_irq_chip
                    +-> data->chip->irq_write_msi_msg(data, msg);
                            +-> pci_msi_domain_write_msg()

从这个流程可以看出,MSI是通过irq_write_msi_msg往一个地址发一个消息来激活一个中断。

 

 

 

 

 

 中断产生

1. 产生MSI中断请求

关于MSI以及MSI-X的详细说明可以参阅王齐老师的《PCI Express体系结构导读》及《Intel® 64 and IA-32 Architectures Software Developer’s Manual》

  PCIe设备通过向MSI/MSI-X Capability中的Message Address地址写Message Data数据,组成一个TLP向处理器提交MSI/MSI-X中断请求,不同处理器采用不同机制处理MSI请求,x86使用FSB Interrupt Message的方式,下图是Intel手册中Message Data的格式,可以看到bit0~7是Vector,因此每一个MSI的请求中都携带了中断向量的值。

  Message Data中Vector是如何设置的?如下图设备驱动在调用request_irq来申请中断的时候会调用到msi_set_affinity,该函数将Vector的值写入msg.data的低8bit,然后调用__write_msi_msg将Message Data写入PCIe配置空间。

2.中断整体流程

  MSI报文中带有Vector信息,根据Vector查找IDT table中的处理函数,x86架构中的IDT无论Vector是多少处理函数都会跳转到common_interrupt,然后执行do_IRQ,函数do_IRQ主要包括以下几个部分

  1. 调用irq_enter增加preempt_count中HARDIRQ的计数,标志着一个HARDIRQ的开始;

    preempt_count变量不为零的时候不可以抢占。

  2. irq = __this_cpu_read(vector_irq[vector])根据vector获得irq的值;

  3. handle_irq做HARDIRQ处理;

  4. 如果handle_irq返回false则调用ack_APIC_irq()向APIC的EOI寄存器写0,通知APIC中断服务完成;

  5. irq_exit调用sub_preempt_count(HARDIRQ_OFFSET)将irq_enter增加的计数减掉,这也标志着HARDIRQ的结束,然后调用in_interrupt()判断preempt_count为0且有softirq pending,就调用invoke_softirq。

2.1. preempt_count

preempt_count是thread_info结构体的成员变量,内核中对preempt_count的描述如下,可以看到其中有softirq count,hardirq count等,preempt_count可以作抢占计数和判断当前所在上下文的情况。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*
* We put the hardirq and softirq counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
*
* The hardirq count can in theory reach the same as NR_IRQS.
* In reality, the number of nested IRQS is limited to the stack
* size as well. For archs with over 1000 IRQS it is not practical
* to expect that they will all nest. We give a max of 10 bits for
* hardirq nesting. An arch may choose to give less than 10 bits.
* m68k expects it to be 8.
*
* - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
* - bit 26 is the NMI_MASK
* - bit 27 is the PREEMPT_ACTIVE flag
*
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x03ff0000
* NMI_MASK: 0x04000000
*/
 

2.2. irq与vector

vector_irq是一个per cpu数组,数组反映了各cpu上vector与irq的对应关系,index代表vector的值,数组中存储的值是irq。

1
2
3
#define NR_VECTORS			 256
typedef int vector_irq_t[NR_VECTORS];
DECLARE_PER_CPU(vector_irq_t, vector_irq);
 

2.3. handle_irq

handle_irq根据irq获取中断描述符结构irq_desc,然后调用generic_handle_irq_desc。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
bool handle_irq(unsigned irq, struct pt_regs *regs)
{
struct irq_desc *desc;

stack_overflow_check(regs);

desc = irq_to_desc(irq);
if (unlikely(!desc))
return false;

generic_handle_irq_desc(irq, desc);
return true;
}

static inline void generic_handle_irq_desc(unsigned int irq, struct irq_desc *desc)
{
desc->handle_irq(irq, desc);
}
 

要继续深入分析,首先要理解中断代码有三个主要的抽象层次:

  • High-level driver API 高级驱动API
  • High-level IRQ flow handlers 高级IRQ流处理程序
  • Chip-level hardware encapsulation 硬件芯片级封装

我们上面的分析(诸如common_interrupt等)包含了许多low-level architecture 代码。当中断触发时,这些底层架构代码通过调用desc->handle_irq来调用通用中断代码,这个handle_irq指向的函数属于High-level IRQ flow handlers的层次,称其为高级IRQ流处理程序(High-level IRQ flow handlers),Kernel中提供了一组预定义的irq-flow方法,这些函数在引导期间或在设备初始化期间由体系结构分配给特定的中断(对desc->handle_irq的赋值):

1
2
3
4
5
6
7
8
9
10
11
12
13
/*
* Built-in IRQ handlers for various IRQ types,
* callable via desc->handle_irq()
*/
extern void handle_level_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_edge_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_edge_eoi_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_simple_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_percpu_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_percpu_devid_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_bad_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_nested_irq(unsigned int irq);
 

高级IRQ流处理程序会调用desc->irq_data.chip原语(irq_chip中的,例如irq_ack),即Chip-level hardware encapsulation硬件芯片级封装,如果中断有specific handler的话还会调用外设的specific handler。

由于我碰巧用GDB断住了handle_edge_irq(边缘触发中断的通用实现),以此举例分析,详见下面的代码以及注释。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/**
* handle_edge_irq - edge type IRQ handler
* @irq: the interrupt number
* @desc: the interrupt description structure for this irq
*
* Interrupt occures on the falling and/or rising edge of a hardware
* signal. The occurrence is latched into the irq controller hardware
* and must be acked in order to be reenabled. After the ack another
* interrupt can happen on the same source even before the first one
* is handled by the associated event handler. If this happens it
* might be necessary to disable (mask) the interrupt depending on the
* controller hardware. This requires to reenable the interrupt inside
* of the loop which handles the interrupts which have arrived while
* the handler was running. If all pending interrupts are handled, the
* loop is left.
*/
void
handle_edge_irq(unsigned int irq, struct irq_desc *desc)
{
raw_spin_lock(&desc->lock);
/*
* 清除IRQS_REPLAY和IRQS_WAITING状态
* IRQS_REPLAY:与irq resend有关,在check_irq_resend中检测到IRQS_PENDING
* 会置位IRQS_REPLAY。
*/
desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
/*
* If we're currently running this IRQ, or its disabled,
* we shouldn't process the IRQ. Mark it pending, handle
* the necessary masking and go out
*/
/*
* 1.该中断被其他的CPU disable了,需要PENDING状态,mask并且ack该中断,待其他
* CPUenable 该中断的时候会resend该中断;
* 2.该中断描述符正在被其他cpu处理(这里需要理解一下currently running this IRQ
* 不是当前的这个中断,而是之前产生的同irq号的中断),需要PENDING状态,mask并
* 且ack该中断,其他CPU稍后会进行处理;
* 3.该中断描述符没有irqaction,没必要执行后续specific handler流程。
*/
if (unlikely(irqd_irq_disabled(&desc->irq_data) ||
irqd_irq_inprogress(&desc->irq_data) || !desc->action)) {
if (!irq_check_poll(desc)) {
desc->istate |= IRQS_PENDING;
mask_ack_irq(desc);
goto out_unlock;
}
}
kstat_incr_irqs_this_cpu(irq, desc); //更新irq的统计信息

/* Start handling the irq */
desc->irq_data.chip->irq_ack(&desc->irq_data); //ack中断,中断被enable

do {
/*
* 如果上次循环handle_irq_event函数中不持有spin lock的那个阶段,其他CPU
* 注销了specific handler,mask irq并退出。
*/
if (unlikely(!desc->action)) {
mask_irq(desc);
goto out_unlock;
}

/*
* When another irq arrived while we were handling
* one, we could have masked the irq.
* Renable it, if it was not disabled in meantime.
*/
/*
* 如果desc处于pending状态(pending的原因上面说过了),将之前mask的unmask掉。
*/
if (unlikely(desc->istate & IRQS_PENDING)) {
if (!irqd_irq_disabled(&desc->irq_data) &&
irqd_irq_masked(&desc->irq_data))
unmask_irq(desc);
}

handle_irq_event(desc);//处理中断请求事件

} while ((desc->istate & IRQS_PENDING) &&
!irqd_irq_disabled(&desc->irq_data));

out_unlock:
raw_spin_unlock(&desc->lock);
}
 

handle_irq_event(desc)函数中调用handle_irq_event_percpu遍历action list执行specific handler,handle_irq_event_percpu函数不展开分析了。

需要注意的是handle_irq_event_percpu函数前后的锁操作,注意上面handle_edge_irq函数开始和结束的lock和unlock不是对应关系,handle_irq_event_percpu前面的raw_spin_unlock对应的是handle_edge_irq开头的raw_spin_lock,而handle_irq_event_percpu后面的raw_spin_lock对应的是handle_edge_irq最后的raw_spin_unlock。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
irqreturn_t handle_irq_event(struct irq_desc *desc)
{
struct irqaction *action = desc->action;
irqreturn_t ret;

desc->istate &= ~IRQS_PENDING; //清除IRQS_PENDING标志
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); //设置IRQD_IRQ_INPROGRESS表示该CPU正在处理该desc的irq
raw_spin_unlock(&desc->lock); //desc解锁

ret = handle_irq_event_percpu(desc, action);

raw_spin_lock(&desc->lock);//desc加锁
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS); //清除IRQD_IRQ_INPROGRESS标志
return ret;
}
 

其他的High-level IRQ flow handlers函数就不做详细分析了,下面是内核文档中对一些High-level irq处理函数的简化摘录

  • handle_level_irq

    1
    2
    3
       :c:func:`desc->irq_data.chip->irq_mask_ack`;
    handle_irq_event(desc->action);
    :c:func:`desc->irq_data.chip->irq_unmask`;
     
  • handle_fastoi_irq

    1
    2
    handle_irq_event(desc->action);
    :c:func:`desc->irq_data.chip->irq_eoi`;
     
  • handle_edge_irq

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    if (desc->status & running) {
    :c:func:`desc->irq_data.chip->irq_mask_ack`;
    desc->status |= pending | masked;
    return;
    }
    :c:func:`desc->irq_data.chip->irq_ack`;
    desc->status |= running;
    do {
    if (desc->status & masked)
    :c:func:`desc->irq_data.chip->irq_unmask`;
    desc->status &= ~pending;
    handle_irq_event(desc->action);
    } while (status & pending);
    desc->status &= ~running;
     
  • handle_simple_irq

    1
    handle_irq_event(desc->action);
     
  • handle_percpu_irq

    1
    2
    3
    4
    5
    if (desc->irq_data.chip->irq_ack)
    :c:func:`desc->irq_data.chip->irq_ack`;
    handle_irq_event(desc->action);
    if (desc->irq_data.chip->irq_eoi)
    :c:func:`desc->irq_data.chip->irq_eoi`;
     

2.4. ack_APIC_irq

回到do_IRQ函数,如果handle_irq返回false则调用ack_APIC_irq向APIC的EOI寄存器写0,通知APIC中断服务完成。那handle_irq返回true就不用ack_APIC_irq了么?答案是高级IRQ流处理程序中调用desc->irq_data.chip原语时做的,例如上面分析的handle_edge_irq函数调用irq_ack时其实就是调用了ack_APIC_irq。

2.5. invoke_softirq

irq_exit将irq_enter增加的计数减掉标志着HARDIRQ的结束,然后调用in_interrupt()判断preempt_count为0且有softirq pending,就调用invoke_softirq。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static inline void invoke_softirq(void)
{
if (!force_irqthreads) {
/*
* We can safely execute softirq on the current stack if
* it is the irq stack, because it should be near empty
* at this stage. But we have no way to know if the arch
* calls irq_exit() on the irq stack. So call softirq
* in its own stack to prevent from any overrun on top
* of a potentially deep task stack.
*/
do_softirq();
} else {
wakeup_softirqd();
}
}

 

posted on 2020-09-04 16:40  tycoon3  阅读(5296)  评论(0编辑  收藏  举报

导航