结合中断上下文切换和进程上下文切换分析Linux内核的一般执行过程

实验要求

结合中断上下文切换和进程上下文切换分析Linux内核一般执行过程

  • 以fork和execve系统调用为例分析中断上下文的切换
  • 分析execve系统调用中断上下文的特殊之处
  • 分析fork子进程启动执行时进程上下文的特殊之处
  • 以系统调用作为特殊的中断,结合中断上下文切换和进程上下文切换分析Linux系统的一般执行过

实验过程

fork系统调用:

fork系统调用用于创建一个新进程,称为子进程,它与进程(称为系统调用fork的进程)同时运行,此进程称为父进程。创建新的子进程后,两个进程将执行fork()系统调用之后的下一条指令。子进程使用相同的pc(程序计数器),相同的CPU寄存器,在父进程中使用的相同打开文件。它不需要参数并返回一个整数值。下面是fork()返回的不同值。

  •       负值:创建子进程失败。
  •   零:返回到新创建的子进程。
  •   正值:返回父进程或调电者。该值包含新创建的子进程的进程ID
#include <unistd.h>
#include <stdio.h>
#include <sys/types.h>
int main (){
          pid_t pid;
          pid = fork();
          if (pid < 0){
                  printf("error in fork!\n");
          }
          else if (pid == 0){
                  printf("i am the child process, my process id is %d\n",getpid());
          }
          else{
                  printf("i am the parent process, my process id is %d\n",getpid());
         }
         return 0;
}

 

 fork内核处理过程:

do_fork的代码:

long do_fork(unsigned long clone_flags,
    unsigned long stack_start,
    unsigned long stack_size,
    int __user *parent_tidptr,
    int __user *child_tidptr)
{
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
    * Determine whether and which event to report to ptracer.  When
    * called from kernel_thread or CLONE_UNTRACED is explicitly
    * requested, no event is reported; otherwise, report if the event
    * for the type of forking is enabled.
    */
    if (!(clone_flags & CLONE_UNTRACED)) {
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if ((clone_flags & CSIGNAL) != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;

        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    }

    p = copy_process(clone_flags, stack_start, stack_size,
        child_tidptr, NULL, trace);
    /*
    * Do this prior waking up the new thread - the thread pointer
    * might get invalid after that point, if the thread exits quickly.
    */
    if (!IS_ERR(p)) {
        struct completion vfork;
        struct pid *pid;

        trace_sched_process_fork(current, p);

        pid = get_task_pid(p, PIDTYPE_PID);
        nr = pid_vnr(pid);

        if (clone_flags & CLONE_PARENT_SETTID)
            put_user(nr, parent_tidptr);

        if (clone_flags & CLONE_VFORK) {
            p->vfork_done = &vfork;
            init_completion(&vfork);
            get_task_struct(p);
        }

        wake_up_new_task(p);

        /* forking complete and child started to run, tell ptracer */
        if (unlikely(trace))
            ptrace_event_pid(trace, pid);

        if (clone_flags & CLONE_VFORK) {
            if (!wait_for_vfork_done(p, &vfork))
                ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
        }

        put_pid(pid);
    } else {
        nr = PTR_ERR(p);
    }
    return nr;
}

整段代码涉及到很多工作的处理,但是整个创建新进程是在上述代码中的copy_process()这个函数实现的。copy_process()函数复制父进程、获得pid、调用wake_up_new_task将子进程加入就绪队列等待调度执行等。在Linux中,除了0号进程由手工创建外,其他进程都是通过复制已有进程创建而来,而这正是fork的主要工作,具体的任务交由copy_process完成。

copy_process()的执行逻辑为:

1)调用 dup_task_struct 复制当前进程的task_struct;

2)将新进程相关的数据结构和进程状态初始化;

3)复制父进程信息;

4)调用 copy_thread_tls 初始化子进程内核栈;

5)设置子进程pid;

6)建立亲属关系链接,并将新进程插入全局进程队列 copy_thread_tls: 拷贝父进程系统堆栈内容;

7)执行childregs->ax = 0语句,该代码将子进程的 eax 赋值为0,do_fork返回后会从eax读取返回值,所以为0;

8)执行p->thread.eip = (unsigned long) ret_from_fork;将子进程的 eip 设置为 ret_form_fork 的首地址,因此子进程是从 ret_from_fork 开始执行的。

execve函数

函数定义:int execve(const char *filename, char *const argv[ ], char *const envp[ ]);

返回值:函数执行成功时没有返回值,执行失败时的返回值为-1.

函数说明:execve()用来执行参数filename字符串所代表的文件路径,第二个参数是利用数组指针来传递给执行文件,并且需要以空指针(NULL)结束,最后一个参数则为传递给执行文件的新环境变量数组。exec

函数一共有六个,其中execve为内核级系统调用,其他(execl,execle,execlp,execv,execvp)都是调用execve的库函数。

 

execve系统调用的过程:

  • execve系统调用陷入内核,并传入命令行参数和shell上下文环境
  • execve陷入内核的第一个函数:do_execve,该函数封装命令行参数和shell上下文
  • SYSCALL_DEFINE3(execve,
            const char __user *, filename,
            const char __user *const __user *, argv,
            const char __user *const __user *, envp)
    {
        return do_execve(getname(filename), argv, envp);
    }

     

  • do_execve调用do_execveat_common,后者进一步调用__do_execve_file,打开ELF文件并把所有的信息一股脑的装入linux_binprm结构体
  • int do_execve(struct filename *filename,
        const char __user *const __user *__argv,
        const char __user *const __user *__envp)
    {
        struct user_arg_ptr argv = { .ptr.native = __argv };
        struct user_arg_ptr envp = { .ptr.native = __envp };
        return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
    }

    static int do_execveat_common(int fd, struct filename *filename,
    struct user_arg_ptr argv,
    struct user_arg_ptr envp,
    int flags)
    {
    return __do_execve_file(fd, filename, argv, envp, flags, NULL);
    }

      

  • __do_execve_file中调用search_binary_handler,寻找解析ELF文件的函数
  • struct linux_binprm *bprm;
        struct files_struct *displaced;
        int retval;  
    
        if (IS_ERR(filename))
            return PTR_ERR(filename);
    
        /*
         * We move the actual failure in case of RLIMIT_NPROC excess from
         * set*uid() to execve() because too many poorly written programs
         * don't check setuid() return code.  Here we additionally recheck
         * whether NPROC limit is still exceeded.
         */
        if ((current->flags & PF_NPROC_EXCEEDED) &&
            atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
            retval = -EAGAIN;
            goto out_ret;
        }
    
        /* We're below the limit (still or again), so we don't want to make
         * further execve() calls fail. */
        current->flags &= ~PF_NPROC_EXCEEDED;
    
        retval = unshare_files(&displaced);
        if (retval)
            goto out_ret;
    
        retval = -ENOMEM;
        bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);  //创建了一个结构体bprm, 把环境变量和命令行参数都复制到结构体中 
        if (!bprm)
            goto out_files;
    
        retval = prepare_bprm_creds(bprm);
        if (retval)
            goto out_free;
    
        check_unsafe_exec(bprm);
        current->in_execve = 1;
    
        if (!file)
            file = do_open_execat(fd, filename, flags);
        retval = PTR_ERR(file);
        if (IS_ERR(file))
            goto out_unmark;
    
        sched_exec();
    
        bprm->file = file;
        if (!filename) {
            bprm->filename = "none";
        } else if (fd == AT_FDCWD || filename->name[0] == '/') {
            bprm->filename = filename->name;
        } else {
            if (filename->name[0] == '\0')
                pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
            else
                pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
                            fd, filename->name);
            if (!pathbuf) {
                retval = -ENOMEM;
                goto out_unmark;
            }
            /*
             * Record that a name derived from an O_CLOEXEC fd will be
             * inaccessible after exec. Relies on having exclusive access to
             * current->files (due to unshare_files above).
             */
            if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
                bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
            bprm->filename = pathbuf;
        }
        bprm->interp = bprm->filename;
    
        retval = bprm_mm_init(bprm);
        if (retval)
            goto out_unmark;
    
        retval = prepare_arg_pages(bprm, argv, envp);
        if (retval < 0)
            goto out;
    
        retval = prepare_binprm(bprm);
        if (retval < 0)
            goto out;
    
        retval = copy_strings_kernel(1, &bprm->filename, bprm);
        if (retval < 0)
            goto out;
    
        bprm->exec = bprm->p;
        retval = copy_strings(bprm->envc, envp, bprm);  //把传入的shell上下文复制到bprm中 
        if (retval < 0)
            goto out;
    
        retval = copy_strings(bprm->argc, argv, bprm);   //把传入的命令行参数复制到bprm中 
        if (retval < 0)
            goto out;
    
        would_dump(bprm, bprm->file);
    
        retval = exec_binprm(bprm);  //准备交给真正的可执行文件加载器了 
        if (retval < 0)
            goto out;
    
        /* execve succeeded */
        current->fs->in_exec = 0;
        current->in_execve = 0;
        rseq_execve(current);
        acct_update_integrals(current);
        task_numa_free(current, false);
        free_bprm(bprm);
        kfree(pathbuf);
        if (filename)
            putname(filename);
        if (displaced)
            put_files_struct(displaced);
        return retval;
    
    out:
        if (bprm->mm) {
            acct_arg_size(bprm, 0);
            mmput(bprm->mm);
        }
    
    out_unmark:
        current->fs->in_exec = 0;
        current->in_execve = 0;
    
    out_free:
        free_bprm(bprm);
        kfree(pathbuf);
    
    out_files: 
        if (displaced)
            reset_files_struct(displaced);
    out_ret:
        if (filename)
            putname(filename);
        return retval;
    }

     

  • search_binary_handler找到ELF文件解析函数load_elf_binary
  • static int exec_binprm(struct linux_binprm *bprm)
    {
        pid_t old_pid, old_vpid;
        int ret;
    
        /* Need to fetch pid before load_binary changes it */
        old_pid = current->pid;
        rcu_read_lock();
        old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
        rcu_read_unlock();
    
        ret = search_binary_handler(bprm);
        if (ret >= 0) {
            audit_bprm(bprm);
            trace_sched_process_exec(current, old_pid, bprm);
            ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
            proc_exec_connector(current);
        }
    
        return ret;
    }
    int search_binary_handler(struct linux_binprm *bprm)
    {
        bool need_retry = IS_ENABLED(CONFIG_MODULES);
        struct linux_binfmt *fmt;
        int retval;
    
        /* This allows 4 levels of binfmt rewrites before failing hard. */
        if (bprm->recursion_depth > 5)
            return -ELOOP;
    
        retval = security_bprm_check(bprm);
        if (retval)
            return retval;
    
        retval = -ENOENT;
     retry:
        read_lock(&binfmt_lock);
        list_for_each_entry(fmt, &formats, lh) {
            if (!try_module_get(fmt->module))
                continue;
            read_unlock(&binfmt_lock);
    
            bprm->recursion_depth++;
            retval = fmt->load_binary(bprm);
            bprm->recursion_depth--;
    
            read_lock(&binfmt_lock);
            put_binfmt(fmt);
            if (retval < 0 && !bprm->mm) {
                /* we got to flush_old_exec() and failed after it */
                read_unlock(&binfmt_lock);
                force_sigsegv(SIGSEGV);
                return retval;
            }
            if (retval != -ENOEXEC || !bprm->file) {
                read_unlock(&binfmt_lock);
                return retval;
            }
        }
        read_unlock(&binfmt_lock);
    
        if (need_retry) {
            if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
                printable(bprm->buf[2]) && printable(bprm->buf[3]))
                return retval;
            if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
                return retval;
            need_retry = false;
            goto retry;
        }
    
        return retval;
    }

     

  • load_elf_binary解析ELF文件,把ELF文件装入内存,修改进程的用户态堆栈(主要是把命令行参数和shell上下文加入到用户态堆栈),修改进程的数据段代码段
  • load_elf_binary调用start_thread修改进程内核堆栈(特别是内核堆栈的ip指针)
  • 进程从execve返回到用户态后ip指向ELF文件的main函数地址,用户态堆栈中包含了命令行参数和shell上下文环境
  • start_thread(structpt_regs *regs, unsigned long new_ip, unsigned long new_sp)
    {
             set_user_gs(regs, 0);
             regs->fs             = 0;
             regs->ds            = __USER_DS;
             regs->es            = __USER_DS;
             regs->ss            = __USER_DS;
             regs->cs            = __USER_CS;
             regs->ip             = new_ip;
             regs->sp            = new_sp;
             regs->flags                = X86_EFLAGS_IF;
             /*
              * force it to the iret return path by makingit look as if there was
              * some work pending.
              */
             set_thread_flag(TIF_NOTIFY_RESUME);
    }

     

Linux系统的一般执行过程

  两个进程间的切换:

  • 正在运行的用户态进程X
  • 发生中断——save cs:eip/esp/eflags(current) to kernel stack,then load cs:eip(entry of a specific ISR) and ss:esp(point to kernel stack).
  • SAVE_ALL //保存现场
  • 中断处理过程中或中断返回前调用了schedule(),其中的switch_to做了关键的进程上下文切换
  • 标号1之后开始运行用户态进程Y(这里Y曾经通过以上步骤被切换出去过因此可以从标号1继续执行)
  • restore_all //恢复现场
  • iret - pop cs:eip/ss:esp/eflags from kernel stack
  • 继续运行用户态进程Y

  中断:

  • 通过中断处理过程中的调度时机,用户态进程与内核线程之间互相切换和内核线程之间互相切换,与最一般的情况非常类似,只是内核线程运行过程中发生中断没有进程用户态和内核态的转换;
  • 内核线程主动调用schedule(),只有进程上下文的切换,没有发生中断上下文的切换,与最一般的情况略简略;
  • 创建子进程的系统调用在子进程中的执行起点及返回用户态,如fork;
  • 加载一个新的可执行程序后返回到用户态的情况,如execve;
posted @ 2020-06-15 09:41  deepfeel  阅读(192)  评论(0)    收藏  举报