Understanding the linux kernel Chapter4 Interrupts and Exceptions
Interrupts and Exceptions
classification
- Interrupts
- Maskable interrupts
- Nonmaskable interrupts
- Exceptions
- Processor-detected exceptions
- Faults
- Traps
- Aborts
- Programmed exceptions
- Processor-detected exceptions
IRQs and Interrupts
the way a hardware device controller used to manage the interrupt requests(IRQs) is the Interrupt Request line(in case of sophisticated devices, using several IRQ lines). all IRQ lines connect to the input-pin of Programmable Interrupt Controller.
Each IRQ line can be selectively disabled. the PIC can be told to stop issuing interrupts that refer to a given IRQ line, or to resume issuing them. Disabled interrupts are not lost; the PIC sends them to the CPU as soon as they are enabled again.
Selective enabling/disabling of IRQs is not the same as global masking/unmasking of maskable interrupts. When the IF flag of the eflags register is clear(use ass_instruction cli sti), each maskable interrupt issued by the PIC is temporarily ignored by the CPU.
The Advanced Programmable Interrupt Controller(APIC)
designed for mutiprocessor, each microprocessor include a local APIC and all local APIC connected to an external I/O APIC(i.e. multi-APIC system).

distributing interrupts
The I/O APIC use the Interrupt Redirection Table to indicate the interrupt vector and priority, the destination processor, and how the processor is selected, with all those contents be programmable.
Interrupt requests coming from external hardware devices can be distributed among the available CPUs in two ways:
- Static distribution
the IRQ signal is distributed base on the interrupt redirection table.
- Dynamic distribution
The IRQ signal is delivered to the local APIC of the processor that is executing the process with the lowest priority
generating interprocessor interrupts
When a CPU wishes to send an interrupt to another CPU, it stores the interrupt vector and the identifier of the target’s local APIC in the Interrupt Command Register (ICR) of its own local APIC. They are actively used by Linux to exchange messages among CPUs.
Exceptions

Interrupt Descriptor Table
A system table called Interrupt Descriptor Table (IDT) associates each interrupt or exception vector with the address of the corresponding interrupt or exception handler.
idtr CPU Register
store the address of IDT
three types descriptors stored in IDT(intel classification)
Linux uses interrupt gates to handle interrupts and trap gates to handle exceptions.
- Task gate
Includes the TSS(Task State Segment) selector of the process that must replace the current one when an interrupt signal occurs.
- Interrupt gate
Includes the Segment Selector and the offset inside the segment of an interrupt or exception handler. set IF flag to disable maskable interrupts.
- Trap gate
same as Interrupt gate except that it don't set IF flag.
Hardware handling of Interrupts and Exceptions
privilege check of exception handler
Makes sure the interrupt was issued by an authorized source. First, it compares the Current Privilege Level (CPL), which is stored in the two least significant bits of the cs register, with the Descriptor Privilege Level (DPL) of the Segment Descriptor included in the GDT. Raises a “General protection” exception if the CPL is lower than the DPL, because the interrupt handler cannot have a lower privilege than the program that caused the interrupt. For programmed exceptions, makes a further security check: compares the CPL with the DPL of the gate descriptor included in the IDT and raises a “General protection” exception if the DPL is lower than the CPL. This last check makes it possible to prevent access by user applications to specific trap or interrupt gates.
(CPL must have higher privilege than the gate so it can accessed this gate(for programmed exceptions), and the handler pointered by gate must have higher privilege than CPL so it can do something to resolve the problem(do interrupt handler))
In conclusion, DPL of the Segment Descriptor identify the lowest CPL that can access this interrupt handler. Conversely, the DPL of the IDT gate descriptor identify the highest CPL that can handled by the interrupt_handler pointed by the gate descriptor.
prepare stage
find handler(check privilege)-----save info and change stack(tss)-----jump to handler
return from handler
load info stored in stack(if stack was changed, change it back)-------clear segement registers
Nested Execution of Interruptis and Exceptions

for exception
Because the “Page Fault” exception handler never gives rise to further exceptions, at most two kernel control paths associated with exceptions (the first one caused by a system call invocation, the second one caused by a Page Fault) may be stacked, one on top of the other.
for interruption
An interrupt handler may preempt both other interrupt handlers and exception handlers. Conversely, an exception handler never preempts an interrupt handler. The only exception that can be triggered in Kernel Mode is “Page Fault”, which we just described. But interrupt handlers never perform operations that can induce page faults, and thus, potentially, a process switch.
On multiprocessor systems, several kernel control paths may execute concurrently. Moreover, a kernel control path associated with an exception may start executing on a CPU and, due to a process switch, migrate to another CPU(no process switch can take place until an interrupt handler is running).
Initializing the Interrupt Descriptor Table
Interrupt, Trap, and System Gates
the linux classification of interruptions in IDT
- Interrupt gate
An Intel interrupt gate that cannot be accessed by a User Mode process (the gate’s DPL field is equal to 0).(means user code can't use this interrupt gate.)
- System gate
An Intel trap gate that can be accessed by a User Mode process (the gate’s DPL
field is equal to 3). The three Linux exception handlers associated with the vectors 4, 5, and 128 are activated by means of system gates, so the three assembly language instructions into, bound, and int $0x80 can be issued in User Mode.
- Sytem interrupt gate
An Intel interrupt gate that can be accessed by a User Mode process (the gate’s DPL field is equal to 3). The exception handler associated with the vector 3 is activated by means of a system interrupt gate, so the assembly language instruction int3 can be issued in User Mode.
- Trap gate
An Intel trap gate that cannot be accessed by a User Mode process (the gate’s DPL field is equal to 0). Most Linux exception handlers are activated by means of trap gates.
- Task gate
An Intel task gate that cannot be accessed by a User Mode process (the gate’s DPL field is equal to 0). The Linux handler for the “Double fault” exception is activated by means of a task gate.
set gate in IDT
set_intr_gate(n,addr);
set_system_gate(n,addr);
set_system_intr_gate(n,addr);
set_trap_gate(n,addr);
set_task_gate(n,gdt);
Preliminary Initialization of the IDT
When kernel initializating: the setup_idt( ) assembly language function starts by filling all 256 entries of idt_table with the same interrupt gate, which refers to the ignore_int( ) interrupt handler. ignore_int() is an empty interrupt handler which print "Unknown interrupt” messages.
The ignore_int( ) handler should never be executed. The occurrence of “Unknown interrupt” messages on the console or in the log files denotes either a hardware problem (an I/O device is issuing unforeseen interrupts) or a kernel problem (an interrupt or exception is not being handled properly).
Exceptoin Handling
Most exceptions issued by the CPU are interpreted by Linux as error conditions. When one of them occurs, the kernel sends a signal to the process that caused the exception to notify it of an anomalous condition
three steps
- Save the contents of most registers in the Kernel Mode stack (this part is coded in assembly language)
- Handle the exception by means of a high-level C function
- Exit from the handler by means of the ret_from_exception( ) function.
handle the Double fault
The “Double fault” exception is handled by means of a task gate instead of a trap or system gate, because it denotes a serious kernel misbehavior.
Thus, the exception handler that tries to print out the register values does not trust the current value of the esp register. When such an exception occurs, the CPU fetches the Task Gate Descriptor stored in the entry at index 8 of the IDT. This descriptor points to the special TSS segment descriptor stored in the 32nd entry of the GDT. Next, the CPU loads the eip and esp registers with the values stored in the corresponding TSS segment. As a result, the processor executes the doublefault_fn() exception handler on its own private stack.
Saving the Registers for the Exception handler
sava error code and the address of the handler in the stack, then jup to assembly code labeled as error_code, which performs:
- store info for the invoking of the handler
- handy invoke handler base on the info stored in the stack.
Entering and leaving the Exception handler
steps:
- do_exception_name() ---> do_trap()(save exception info in current->thread and send signal to the process)
- check whether exception occurred in User Mode or in Kernel Mode, if in kernel mode
- case 0:kernel fault, invoke die() to print all info on the console and call do_exit() to terminates the current process.
- case 1:invaild argumetn passed to the kernel.
- jmp the code labeled ret_from_exception().
Interrupt Handling
three main classes of interrupts
- I/O interrupts
An I/O device requires attention; the corresponding interrupt handler must query the device to determine the proper course of action.
- Timer interrupts
Some timer, either a local APIC timer or an external timer, has issued an interrupt; this kind of interrupt tells the kernel that a fixed-time interval has elapsed. These interrupts are handled mostly as I/O interrupts;
- Interprocessor interrupts
A CPU issued an interrupt to another CPU of a multiprocessor system.
I/O Interrupt Handling
some device might share the same IRQ line,which achieved by:
- IRQ sharing(interrupt service routinews(ISRs))
- IRQ dynamic allocation

Linux divides the actions to be performed following an interrupt into three classes:
- Critical
Actions such as acknowledging an interrupt to the PIC, reprogramming the PIC or the device controller, or updating data structures accessed by both the device and the processor.
- Noncritical
Actions such as updating data structures that are accessed only by the processor(for instance, reading the scan code after a keyboard key has been pushed).These actions can also finish quickly, so they are executed by the interrupt handler immediately, with the interrupts enabled.
- Noncritical deferrable
Actions such as copying a buffer’s contents into the address space of a process. These may be delayed for a long time interval without affecting the kernel operations; the interested process will just keep waiting for the data.
basic action of I/O handler:
- Save the IRQ value and the register’s contents on the Kernel Mode stack.
- Send an acknowledgment to the PIC that is servicing the IRQ line
- Execute the interrupt service routines (ISRs) associated with all the devices that share the IRQ.
- exit and call ret_from_intr().
IRQ data structures

unexpected interrupt
either if there is no ISR associated with the IRQ line, or if no ISR associated with the line recognizes the interrupt as raised by its own hardware device.
1.irq_desc_t
Every interrupt vector has its own irq_desc_t descriptor. All such descriptors are grouped together in the irq_desc array.
status
stores Flags describing the IRQ line status.
IRQ_INPROGRESS
IRQ_DISABLED//The IRQ line has been deliberately disabled by a device driver.
IRQ_PENDING
IRQ_REPLAY
IRQ_AUTODETECT
IRQ_WAITING
IRQ_LEVEL
IRQ_MACKED
IRQ_PRE_CPU
depth
The depth field and the IRQ_DISABLED flag of the irq_desc_t descriptor specify whether the IRQ line is enabled or disabled.
Every time the disable_irq() or disable_irq_nosync() function is invoked, the depth field is increased; if depth is equal to 0, the function disables the IRQ line and sets its IRQ_DISABLED flag. Conversely, each invocation of the enable_irq() function decreases the field; if depth becomes 0, the function enables the IRQ line and clears its IRQ_DISABLED flag.
handler
points to the PIC object(such as hw_irq_controller, see below) that services the IRQ line.
action
Identifies the interrupt service routines(ISR) to be invoked when the IRQ occurs. The field points to the first element of the list of irqaction descriptors associated with the IRQ. irqaction descriptor(describe the device sharing this irq line, see blow).
2.hw_interrupt_type
PIC objects, consisting of the PIC name and seven PIC standard methods.
3.irqaction
each irqaction refers to a specfic hardware device and a specific interrupt.
handler
Points to the interrupt service routine(ISR) for an I/O device. This is the key field that allows many devices to share the same IRQ.
flags
This field includes a few fields that describe the relationships between the IRQ line and the I/O device.
SA_INTERRUPT//The handler must execute with interrupts disabled
SA_SHIRQ//The device permits its IRQ line to be shared with other devices
SA_SAMPLE_RANDOM
next
Points to the next element of a list of irqaction descriptors. The elements in the list refer to hardware devices that share the same IRQ.
4.irq_stat
the irq_stat array includes NR_CPUS entries, one for every possible CPU in the system. Each entry of type irq_cpustat_t includes a few counters and flags used by the kernel to keep track of what each CPU is currently doing.
IRQ distribution in multiprocessor systems
TPR(task priority register)
arbitration priority registers
if tpr is same, base on the arbitration priority registers of local cpu.
IRQ affinity of a CPU
distribute interrupt to cpu handly in case of unfair dstribution. Linux 2.6 makes use of a special kernel thread called kirqd to correct, if necessary,the automatic assignment of IRQs to CPUs.
by modifying the Interrupt Redirection Table entries of the I/O APIC, it is possible to route an interrupt signal to a specific CPU. This can be done by invoking the set_ioapic_affinity_irq() function.(or change the CPU bitmap mask in the /proc/irq/n/smp_affinity, n denotes the interrupt vector)
Multiple Kernel Mode stacks
if thread_union is 8kb, this process's kernel stack is used to every kernel control path.Conversely,if 4kb, there are tree type stack:
- The exception stack
contained in per_process thread union date structure.
- The hard IRQ stack
handle interrupt, There is one hard IRQ stack for each CPU in the system, and each stack is contained in a single page frame.(contained in hardirq_stack array)
- The soft IRQ stack
handler deferrable task.There is one soft IRQ
stack for each CPU in the system, and each stack is contained in a single page frame.(contained in softirq_stack array)
All hard IRQ stacks are contained in the hardirq_stack array, while all soft IRQ stacks are contained in the softirq_stack array. Each array element is a union of type irq_ctx that span a single page. At the bottom of this page is stored a thread_info structure, while the spare memory locations are used for the stack;
execution of interrupt handler
functions interrupt[n]
interrupt[n] is used to initialize entries in IDT.
for (i = 0; i < NR_IRQS; i++)
if (i+32 != 128)
set_intr_gate(i+32,interrupt[i]);
The interrupt array is built through a few assembly language instructions.
pushl $n-256
jmp common_interrupt
The kernel represents all IRQs through negative numbers, because it reserves positive interrupt numbers to identify system calls.
common_interrupt
the code labeled common_interrupt stores registers and call fun do_IRQ() and jump to ret_from_intr to return.
common_interrupt:
SAVE_ALL
movl %esp,%eax
call do_IRQ
jmp ret_from_intr
eax points to the stack location containing the last register value pushed on by SAVE_ALL.
The do_IRQ() function
The do_IRQ() function is invoked to execute all interrupt service routines associated with an interrupt. It is declared as follows:
_ _attribute_ _((regparm(3))) unsigned int do_IRQ(struct pt_regs *regs)
The regparm keyword instructs the function to go to the eax register to find the value of the regs argument;
steps do_IRQ() do
- call irq_enter() to increase preempt_count which represent the number of nested interrupt handlers.
- check and change to hard IRQ stack if needed.
- Invokes the __do_IRQ() function passing to it the pointer regs and the IRQ number obtained from the regs->orig_eax field
- if stack was changed to hard IRQ stack, changed back.
- call irq_exit() macro to decrease preempt_count and check deferrable task.
The __do_IRQ() function
The __do_IRQ() function receives as its parameters an IRQ number (through the eax register) and a pointer to the pt_regs structure where the User Mode register values have been saved.
steps
- disable local interrupts untill the handler terminates.(this interrupts still can be accepted by other CPUs)
- set few flags of the irq_desc_t
- check if the interrupt is disabled. if so, do nothing.
- set IRQ_INPROGRESS flag and invoke handle_IRQ_event().
- invoke irq_desc_t->handler->end
handle_IRQ_event()
- Enables the local interrupts with the sti assembly language instruction if the SA_INTERRUPT flag is clear.
- Executes each interrupt service routine of the interrupt.(call each action->handler in list irq_desc_t->action)
- Disables local interrupts with the cli assembly language instruction.
- return 0 if no interrupt service routine has recognized interrupt, 1 otherwise
SA_INTERRUPT
The SA_INTERRUPT flag of the main IRQ descriptor determines whether interrupts must be enabled or disabled when the do_IRQ( )function invokes an ISR.
Dynamic allocation of IRQ lines
There is a way in which the same IRQ line can be used by several hardware devices even if they do not allow IRQ sharing. The trick is to serialize the activation of the hardware devices so that just one owns the IRQ line at a time.
//creates a new irqaction descriptor
struct irqaction* irq = request_irq(6, floppy_interrupt, SA_INTERRUPT|SA_SAMPLE_RANDOM, "floppy", NULL);
//set up interrupt
setup_irq(6, irq);
//if ops on device concluded, release
free_irq(6,NULL);
Interprocessor Interrupt Handling
Interprocessor interrupts allow a CPU to send interrupt to any other CPU in the system.
three kinds of interprocessor interrupts
- CALL_FUNCTION_VECTOR(vector 0xfb)
Sent to all CPUs but the sender, forcing those CPUs to run a function passed by the sender. The corresponding interrupt handler is named call_function_interrupt( ).
- RESCHEDULE_VECTOR(vector 0xfc)
When a CPU receives this type of interrupt, the corresponding handler—named reschedule_interrupt()—limits itself to acknowledging the interrupt.
- INVALIDATE_TLB_VECTOR(vector 0xfd)
Sent to all CPUs but the sender, forcing them to invalidate their Translation Lookaside Buffers.
send_IPI_all( )
//Sends an IPI to all CPUs (including the sender)
send_IPI_allbutself( )
//Sends an IPI to all CPUs except the sender
send_IPI_self( )
//Sends an IPI to the sender CPU
send_IPI_mask()
//Sends an IPI to a group of CPUs specified by a bit mask
Softirqs and Tasklets
Softirqs and tasklets are strictly correlated, because tasklets are implemented on top of softirqs. Softirqs are statically allocated (i.e., defined at compile time), while tasklets can also be allocated and initialized at runtime. Softirqs can run concurrently on several CPUs, even if they are of the same type. Thus, softirqs are reentrant functions. Tasklets' execution is controlled more strictly by the kernel. Tasklets of the same type are always serialized: in other words, the same type of tasklet cannot be executed by two CPUs at the same time. However, tasklets of different types can be executed concurrently on several CPUs.
terms
softirq
which appears in the kernel source code, often denotes both kinds of deferrable functions.
interrupt context
it specifies that the kernel is currently executing either an interrupt handler or a deferrable function.
Softirqs

The index of a sofirq determines its priority: a lower index means higher priority because softirq functions will be executed starting from index 0.
Data structures used for softirqs
1.softirq_vec
softirq_vec array, includes 32 elements of type softirq_action. The priority of a softirq is the index of the corresponding softirq_action element inside the array.
2.softirq_action
The softirq_action data structure consists of two fields: an action pointer to the softirq function and a data pointer to a generic data structure that may be needed by the softirq function.
3.preempt_count
store in the thread_info( of the current process or of the irq_ctx union) , used to keep track both of kernel preemption and of nesting of kernel control paths.

There is a good reason for the name of the preempt_count field: kernel preemptability has to be disabled either when it has been explicitly disabled by the kernel code (preemption counter not zero) or when the kernel is running in interrupt context.
The in_interrupt() macro checks the hardirq and softirq counters in the current_thread_info()->preempt_count field. If either one of these two counters is positive, the macro yields a nonzero value, otherwise it yields the value zero.
4.irq_cpustat_t->__softirq_pending
per-CPU 32-bit mask describing the pending softirqs.
The do_softirq() function
steps
- check state with in_interrupt()
- executes local_irq_save to save state of IF and disable interrupt on local CPU.
- if needed, changed to soft IRQ stack in array softirq_ctx.
- invoke __do_sofirq()
- restore kernel stack if it was changed in step 3.
- execute local_irq_restore to restore state
of flag IF.
The __do_softirq() funcion
The _ _do_softirq() function reads the softirq bit mask of the local CPU and executes the deferrable functions corresponding to every set bit.(performs a fixed number of iterations to handle new accessd task)
steps
- Initializes the iteration counter to 10.
- Copies the softirq bit mask of the local CPU.
- invoke local_bh_disable() to increase the softirq counter.
- Clears the softirq bitmap of the local CPU, so that new softirqs can be activated
- Executes local_irq_enable() to enable local interrupts.
- call softirq_vec[n]->action
- local_irq_disable()
- decrease iteration counter and jump to step 4 untill counter == 0 or pending == 0
- If there are more pending softirqs, it invokes wakeup_softirqd() to wake up the kernel thread that takes care of the softirqs for the local CPU
The ksoftirqd kernel threads
for(;;) {
set_current_state(TASK_INTERRUPTIBLE);
schedule();
/* now in TASK_RUNNING state */
while (local_softirq_pending()) {
preempt_disable();
do_softirq();
preempt_enable();
cond_resched();
}
}
When awakened, the kernel thread checks the local_softirq_pending() softirq bit mask and invokes, if necessary, do_softirq().
high frequency softirqs
softirqs may reactivate themselves causing high frequency softirqs
one solution is ksoftirqd kernel threads
Tasklets
base on softirqs
Tasklets and high-priority tasklets are stored in the tasklet_vec and tasklet_hi_vec arrays, respectively. Both of them include NR_CPUS elements of type tasklet_head, and each element consists of a pointer to a list of tasklet descriptors(struct tasklet_struct).
tasklet_struct
state
- TASKLET_STATE_SCHED//means has been scheduled for execution
- TASKLET_STATE_RUN
ops
//disable tasklet(by invrease the count field of tasklet_struct)
tasklet_disable()
tasklet_disable_nosync()//return untill running tasklet terminated
//reenable
tasklet_enable()
//activate the tasklet
tasklet_schedule()
tasklet_hi_schedule()
//execute tasklet, which is registed in softirq_vec and invoked by *do_softirq()*
tasklet_hi_action()
tasklet_action()
tasklet_action()
- disable local interrupts
- get local CPU number n**
- store the list of tasklet_vec[n] and set it NULL
- enable local interrupts
- execute the tasklet function if not disabled or is executing.
- In multiprocessor systems, checks the TASKLET_STATE_RUN flag of the tasklet.(avoid same tasklet executing in the other CPU.)
Notice that, unless the tasklet function reactivates itself, every tasklet activation triggers at most one execution of the tasklet function.
Work Queues
it allows kernel functions to be activated (much like deferrable functions) and later executed by special kernel threads called worker threads.
differents between work queues and deferrable task
deferrable functions run in interrupt context while functions in work queues run in process context(can block).However, a function in a work queue is executed by a kernel thread, so there is no User Mode address space to access(can't access user data).
Work queue data structures
1.workqueue_struct
contains an array of NR_CPUS elements(struct cpu_workqueue_struct)
2.work_struct
represent the pending task storing in field worklist of cpu_workquueu_struct
Work queue functions
//create
create_workqueue("workqueue_name");//returns the address of a workqueue_struct descriptor for the newly created and creat n(CPU number) worker threads.
create_singlethread_workqueue();//create one worker thread
//destroy
destroy_workqueue()
//insert work
queue_work();//with no repeat work
//insert work in worklist untill tiemr point
queue_delayed_work()//receive a parameter represent the time delay of the execution
//cancel delayed work if a delayed work has not insert into worklist
cancel_delayed_work()
// blocks the calling process until all functions that are pending in the work queue terminate.(not wait for the task register before the calling of the flush_workqueue)
flush_workqueue();//ignore inserted work after invoking flush_workquue
The prdefined work queue
the kernel offers a predefined work queue called events, which can be freely used by every kernel developer.
ops

functions executed in the predefined work queue should not block for a long time: because the execution of the pending functions in the work queue list is serialized on each CPU, a long delay negatively affects the other users of the predefined work queue.
Returning from Interrupts and Exceptions
several issues must be considered before return
- Number of kernel control paths being concurrently executed
- Pending process switch requests
- Pending signals
- Single-step mode
- Virtual-808 mode
A few flags are used to keep track of pending process switch requests, of pending sigals, and of single step execution; they are stored in the flags field of the thread_info descriptor.

flow of return from interrupt(exception)
A difference(between interrupt and exception) exists only if support for kernel preemption has been selected as a compilation option: in this case, local interrupts are immediately disabled when returning from exceptions.


浙公网安备 33010602011771号