Notes: Understanding the linux kernel Chapter 10 System Calls

System Call Handler and Service Routines

in the 80 × 86 architecture a Linux system call can be invoked in two different ways, and both of them jump to an assembly language function called the system call handler. The system call handler

Saves the contents of most registers in the Kernel Mode stack (this operation is common to all system calls and is coded in assembly language).
Handles the system call by invoking a corresponding C function called the system call service routine.
Exits from the handler: the registers are loaded with the values saved in the Kernel Mode stack, and the CPU is switched back from Kernel Mode to User Mode.

Entering and Exiting a System Call

Issuing a System Call via the int $0x80 Instruction

The vector 128—in hexadecimal, 0x80—is associated with the kernel entry point. The trap_init() function, invoked during kernel initialization, sets up the Interrupt Descriptor Table entry corresponding to vector 128 as follows:

set_system_gate(0x80, &system_call);//the way to set up interrupt descriptor table

The system_call() function

The system_call( ) function starts by saving the system call number and all the CPU registers that may be used by the exception handler on the stack—except for eflags, cs, eip, ss, and esp, which have already been saved automatically by the control unit.

	# system call handler stub
ENTRY(system_call)
	pushl %eax			# save orig_eax
	SAVE_ALL
	GET_THREAD_INFO(%ebp)
    ......

//the way to load the address of thread_info to ebx
/* how to get the thread information struct from ASM */
#define GET_THREAD_INFO(reg) \
	movl $-THREAD_SIZE, reg; \ /* 0xffffe000 for 8-KB stacks or 0xfffff000 for 4-KB stacks */
	andl %esp, reg

Next, the system_call( ) function checks whether either one of the TIF_SYSCALL_TRACE and TIF_SYSCALL_AUDIT flags included in the flags field of the thread_info structure is set—that is, whether the system call invocations of the executed program are being traced by a debugger. If this is the case, system_call( ) invokes the do_syscall_trace( ) function twice: once right before and once right after the execution of the system call service routine (as described later). This function stops current and thus allows the debugging process to collect information about it.

ENTRY(system_call)
......
	testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
	jnz syscall_trace_entry
......
    syscall_trace_entry:
	movl $-ENOSYS,EAX(%esp)
	movl %esp, %eax
	xorl %edx,%edx
	call do_syscall_trace
	movl ORIG_EAX(%esp), %eax
	cmpl $(nr_syscalls), %eax
	jnae syscall_call
	jmp syscall_exit

A validity check is then performed on the system call number passed by the User Mode process. If it is greater than or equal to the number of entries in the system call dispatch table, the system call handler terminates.

cmpq $__NR_syscall_max,%rax
	ja badsys

do syscall

movq %r10,%rcx
	call *sys_call_table(,%eax,8)  # eax store the syscall_number
	movl %eax,EAX(%esp)		# store the return value

Exiting from the system call
the system_call() function disables the local interrupts and checks the flags in the thread_info structure of current.

#define _TIF_ALLWORK_MASK	0x0000FFFF	/* work to do on any return to u-space */

syscall_exit:
	cli				# make sure we don't miss an interrupt
					# setting need_resched or sigpending
					# between sampling and the iret
	movl TI_flags(%ebp), %ecx
	testw $_TIF_ALLWORK_MASK, %cx	# test flag current->work
	jne syscall_exit_work
restore_all:
   ......
syscall_exit_work:
	testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP), %cl
  ......

Issuing a System Call via the sysenter Instruction

The int assembly language instruction is inherently slow because it performs several consistency and security checks.
The sysenter instruction, dubbed in Intel documentation as “Fast System Call,” provides a faster way to switch from User Mode to Kernel Mode.

The sysenter instruction

The sysenter assembly language instruction makes use of three special registers that must be loaded with the following information. “MSR” is an acronym for “Model-Specific Register” and denotes a register that is present only in some models of 80 × 86 microprocessors.

SYSENTER_CS_MSR

The Segment Selector of the kernel code segment
SYSENTER_EIP_MSR

The linear address of the kernel entry point
SYSENTER_ESP_MSR

The kernel stack pointer

When the sysenter instruction is executed, the CPU control unit:

Copies the content of SYSENTER_CS_MSR into cs.
Copies the content of SYSENTER_EIP_MSR into eip.
Copies the content of SYSENTER_ESP_MSR into esp(temporary store the address of TSS, from which we can get kernel stack of each process).
Adds 8 to the value of SYSENTER_CS_MSR, and loads this value into ss. i.e., loads the Segment Selector of kernel date segment(also the kernel stack segment) to ss.(the kernel stack segment coincides with the kernel data segment)

The three model-specific registers are initialized by the enable_sep_cpu() function, which is executed once by every CPU in the system during the initialization of the kernel.

Writes the Segment Selector of the kernel code (_ _KERNEL_CS) in the SYSENTER_CS_MSR register.
Writes in the SYSENTER_EIP_MSR register the linear address of the sysenter_entry() function described below.
Computes the linear address of the end of the local TSS, and writes this value in the SYSENTER_ESP_MSR register.(TSS is different for each processor, so this value should set by every CPU)

at every process switch the kernel saves the kernel stack pointer of the current process in the esp0 field of the local TSS. Thus, the system call handler reads the esp register, computes the address of the esp0 field of the local TSS, and loads into the same esp register the proper kernel stack pointer.(that is, per-CPU TSS stores the address of kernel stack of the active process in this CPU)

The vsyscall page

in the initialization phase the sysenter_setup() function builds a page frame called vsyscall page containing a small ELF shared object (i.e., a tiny ELF dynamic library). When a process issues an execve() system call to start executing an ELF program, the code in the vsyscall page is dynamically linked to the process address space. The code in the vsyscall page makes use of the best available instruction to issue a system call.The sysenter_setup() function allocates a new page frame for the vsyscall page and associates its physical address with the FIX_VSYSCALL fix-mapped linear address.

If the CPU does not support sysenter, the function builds a vsyscall page that includes the code:

_ _kernel_vsyscall:
int $0x80
ret

if the CPU does support sysenter, the function builds a vsyscall page that includes the code:

_ _kernel_vsyscall:
pushl %ecx
pushl %edx
pushl %ebp
movl %esp, %ebp
sysenter

When a wrapper routine in the standard library must invoke a system call, it calls the _ _kernel_vsyscall() function, whatever it may be.

Entering the system call

The sequence of steps performed when a system call is issued via the sysenter instruction is the following.

The wrapper routine in the standard library loads the system call number into the eax register and calls the _ _kernel_vsyscall() function.
The _ _kernel_vsyscall() function saves on the User Mode stack the contents of ebp, edx, and ecx (these registers are going to be used by the system call handler), copies the user stack pointer in ebp(see step 4.d), then executes the sysenter instruction.
The CPU switches from User Mode to Kernel Mode, and the kernel starts executing the sysenter_entry() function (pointed to by the SYSENTER_EIP_MSR register).
The sysenter_entry() assembly language function performs the following steps:
a. Sets up the kernel stack pointer movl -508(%esp), %esp.(i.e., %esp = (struct TSS*)(%esp)->esp0)
b. Enables local interrupts:sti
c. Saves in the Kernel Mode stack the Segment Selector of the user data segment, the current user stack pointer, the eflags register, the Segment Selector of the user code segment, and the address of the instruction to be executed when exiting from the system call:(emulate operations performed by the int)
pushl $(__USER_DS) \n pushl %ebp \n pushfl \n pushl $(__USER_CS) \n pushl $SYSENTER_RETURN
d. Restores in ebp the original value of the register passed by the wrapper routine: movl (%ebp), %ebp.(now the %ebp point to the register infomation store by wrapper routine(which actually is user's %esp) in the user mode stack)
e. Invokes the system call handler.

Exiting from the system call

same as return operation performed by system_call.

If the sysenter_entry() function determines that the flags are cleared, it performs a quick return to User Mode:

movl 40(%esp), %edx
movl 52(%esp), %ecx
xorl %ebp, %ebp
sti
sysexit

The edx and ecx registers are loaded with a couple of the stack values saved by sysenter_entry() in step 4c in the previos section: edx gets the address of the SYSENTER_RETURN label, while ecx gets the current user data stack pointer.

The sysexit instruction

The sysexit assembly language instruction is the companion of sysenter: it allows a fast switch from Kernel Mode to User Mode. When the instruction is executed, the CPU control unit performs the following steps:

Adds 16 to the value in the SYSENTER_CS_MSR register, and loads the result in the cs register.(__USER_CS = __KERNEL_CS + 2*size_of(GDT)//2*8 )
Copies the content of the edx register into the eip register.(%edx point to SYSENTER_RETURN)
Adds 24 to the value in the SYSENTER_CS_MSR register, and loads the result in the ss register.
Copies the content of the ecx register into the esp register.（%ecx is the current user data stack pointer）

The SYSENTER_RETURN code

//simply restores the original contents of the ebp, edx, and ecx registers saved in the User Mode stack, and returns the control to the wrapper routine in the standard library.
SYSENTER_RETURN:
popl %ebp
popl %edx
popl %ecx
ret

Parameter Passing

why use register passing parameter instead of stack

working with two stacks at the same time is complex.(fetch from user stack, stored in the kernel stack)
the use of registers makes the structure of the system call handler similar to that of other exception handlers.

restriction

The length of each parameter cannot exceed the length of a register (32 bits).
The number of parameters must not exceed six, besides the system call number passed in eax, because 80 × 86 processors have a very limited number of registers.

system calls that require more than six parameters exist. In such cases, a single register is used to point to a memory area in the process address space that contains the parameter values, which is handled by the libc wrapper.

The registers used to store the system call number and its parameters are, in increasing order, eax (for the system call number), ebx, ecx, edx, esi, edi, and ebp. As seen before, system_call( ) and sysenter_entry( ) save the values of these registers on the Kernel Mode stack by using the SAVE_ALL macro. Therefore, when the system call service routine goes to the stack, it finds the return address to system_call( ) or to sysenter_entry( ), followed by the parameter stored in ebx (the first parameter of the system call), the parameter stored in ecx, and so on.

the way passing parameter which exceed 6

Verifying the Parameters

Whenever a parameter specifies an address, the kernel must check whether it is inside the process address space.Since version 2.2, the kernel check the address in a coarse way(Verify just that the linear address is lower than PAGE_OFFSET)
instead of Verifing if that the linear address belongs to the process address space to reduce consumming. This checking if performed by access_ok().

int access_ok(const void * addr, unsigned long size)
{
unsigned long a = (unsigned long) addr;
if (a + size < a ||          // this statement is used to check overflow of uint_32
a + size > current_thread_info()->addr_limit.seg) // addr_limit.seg usually has value PAGE_OFFSET
return 0;
return 1;
}

Accessing the Process Address Space

get_user( ) and put_user( ). The first can be used to read 1, 2, or 4 consecutive bytes from an address, while the second can be used to write data of those sizes into an address.

in get_user(x,ptr),the size of the variable pointed to by ptr causes the function to expand into a __get_user_1( ), __get_user_2( ), or __get_user_4( ) assembly language function.

......
__get_user_2:
addl $1, %eax
jc bad_get_user
movl $0xffffe000, %edx /* or 0xfffff000 for 4-KB stacks */
andl %esp, %edx
cmpl 24(%edx), %eax  # 24(%edx) == current_thread_info()->addr_limit.seg
jae bad_get_user # instruction above perform access_ok()
2: movzwl -1(%eax), %edx
xorl %eax, %eax
ret
bad_get_user:      # It returns an –EFAULT error code to the process that issued the system call.
xorl %edx, %edx
movl $-EFAULT, %eax  
ret
......
.section __ex_table,"a" # the "a" attribute specifies that the section must be loaded into memory together with the rest of the kernel image
.long 1b, bad_get_user
.long 2b, bad_get_user
.long 3b, bad_get_user
.previous      # The .previous directive forces the assembler to insert the code that follows into the 
               # section that was active when the last .section directive was encountered.
......

If the addresses are valid, the function executes the movzwl instruction to store the data to be read in the two least significant bytes of edx register while setting the highorder bytes of edx to 0（which instruction is movzwl）; then it sets a 0 return code in eax and terminates.

The put_user(x,ptr) macro is similar to the one discussed before, except it writes the value x into the process address space starting from address ptr. Depending on the size of x, it invokes either the __put_user_asm( ) macro (size of 1, 2, or 4 bytes) or the __put_user_u64( ) macro (size of 8 bytes). Both macros return the value 0 in the eax register if they succeed in writing the value, and -EFAULT otherwise.

functions provided by kernel

Notice that many of them also have a variant prefixed by two underscores (_ _). The ones without initial underscores take extra time to check the validity of the linear address interval requested, while the ones with the underscores bypass that check.

Dynamic Address Checking: The Fix-up Code

three cases in which Page Fault exceptions may occur in Kernel Mode. These cases must be distinguished by the Page Fault handler, because the actions to be taken are quite different.

The kernel attempts to address a page belonging to the process address space, but either the corresponding page frame does not exist or the kernel tries to write a read-only page. In these cases, the handler must allocate and initialize a
new page frame.
The kernel addresses a page belonging to its address space, but the corresponding Page Table entry has not yet been initialized.
Some kernel functions include a programming bug that causes the exception to be raised when that program is executed; alternatively, the exception might be caused by a transient hardware error.
a system call service routine attempts to read or write into a memory area whose address has been passed as a system call parameter, but that address does not belong to the process address space.

The Exception Tables

the kernel put the address of each kernel instruction that accesses the process address space into a structure called the exception table. When a Page Fault exception occurs in Kernel Mode, the do_page_fault( ) handler examines the exception table: if it includes the address of the instruction that triggered the exception, the error is caused by a
bad system call parameter; otherwise, it is caused by a more serious bug.

the generation of the exception table

the main exception table

The main exception table is automatically generated by the C compiler when building the kernel program image. It is stored in the __ex_table section of the kernel code segment, and its starting and ending addresses are identified by two symbols produced by the C compiler: __start___ex_table and __stop___ex_table.

local exception table for each moudles

each dynamically loaded module of the kernel (see Appendix B) includes its own local exception table. This table is automatically generated by the C compiler when building the module image, and it is loaded into memory when the module is inserted in the running kernel.

struct exception_table_entry
fields:

insn : The linear address of an instruction that accesses the process address space
fixup : The address of the assembly language code to be invoked when a Page Fault exception triggered by the instruction located at insn occurs.

the fixup code usually consists of inserting a sequence of instructions that forces the service routine to return an
error code to the User Mode process. These instructions, which are usually defined in
the same macro or function that accesses the process address space, are placed by the
C compiler into a separate section of the kernel code segment called .fixup

search_exception_tables()

The search_exception_tables( ) function is used to search for a specified address in all exception tables: if the address is included in a table, the function returns a pointer to the corresponding exception_table_entry structure; otherwise, it returns NULL.

//usage in the do_page_fault()
if ((fixup = search_exception_tables(regs->eip))) {// here regs->eip stores the intruction triggers exception
regs->eip = fixup->fixup; // do_page_fault replace this value with fixup, the process interrupted will re-excuted in fixup.
return 1;
}

Generating the Exception Tables and the Fixup Code

see the comments of the assembly .section __ex_table,"a" above.

posted @ 2024-06-04 10:45 A2023 阅读(34) 评论(0) 收藏举报

刷新页面返回顶部

syp2023