gvisor syscall 原理

https://terassyi.net/posts/2020/04/14/gvisor.html

Wenbo Shen 申文博

 https://wenboshen.org/posts/2018-12-25-gvisor-inside.html

System calls

For Linux kernel, Anatomy of a system call, part 1 gives a good overview of how syscall is handled in kernel. MSR_LSTAR is a Model-Specific Registers, used to hold “Target RIP for the called procedure when SYSCALL is executed in 64-bit mode”, details in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers Table 2-2. On the latest kernel v4.20, syscall_init sets MSR_LSTAR to be entry_SYSCALL_64, which will jump to syscall according to the syscall number at do_syscall_64.

For gvisor, from How gvisor trap to syscall handler in kvm platform, “On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler sysenter, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.”

SyscallTable is a struct. All the implemented syscalls are listed in var AMD64.

 

On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.
 
System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.

 

The sentry is developed as a normal user-space application​ (see "How is gVisor different from other container isolation mechanisms?" and the following Architecture section of our README). As such, it may make host system calls for many different reasons. e.g., external file system access performs read()s and write()s to a 9p server over a Unix Domain Socket. The Go runtime itself uses clone(), futex(), and mmap() (among others) for host system thread creation, synchronization primitives, and memory allocation, respectively.
 
The vast majority of sentry code (anything outside of pkg/sentry/platform/kvm or pkg/sentry/platform/ring0) assumes that it is a normal Linux process. Those packages are responsible for ensuring that interactions with the host (syscalls) still work properly.

 

o the overall architecture looks like below? 
 
Ring 3    User App         |     Sentry
------------------------------------------------    guest
Ring 0                Sentry.ring0
 
///////////////////////////////////////////////////////////////////////
 
Ring 3                Sentry.kvm_platform   host
 
Is it correct that when the user app makes a syscall, it will first be intercepted by the sentry at ring 0 in the guest. Then it will be actually handled by the Sentry emulator running at ring3 in the guest. If the Sentry emulator hits a syscall or needs some resources, it will switch to the host and be handled by the host linux?
 
Almost, except in guest mode, the sentry always executes in ring 0. You can see the core flow here: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/kernel_amd64.go#L215-L231
 
The sentry is normally mapped at a normal userspace address which cannot be mapped into application address spaces (since it would conflict with application mappings). So there is a sentry page table with the normal mappings, plus a mirror of relevant sentry mappings in the kernel range (bit 63 set) in all application page tables. This mirrored copy is what executes between jumpToKernel() and jumpToUser().
 
iret()/sysret() save RSP/RBP so that the syscall handler (sysenter()) can restore them and then "return" to the call site in SwitchToUser.
 
The full execution path looks like:
kernel.runApp.execute -> kernel.Task.p.Switch (kvm.context.Switch) -> kvm.vCPU.SwitchToUser -> ring0.CPU.SwitchToUser
 
kernel.runApp is part of the core task lifecycle state machine which handles application syscalls (eventually calling one of the handlers). The kernel package is independent of the execution platform.
 
 
 

How does KVM system call redirection work?

 

  1. During setup, the sentry sets LSTAR to the syscall handler, sysenter (Just like any OS).
  2. In SwitchToUser, the sentry calls sysret (or iret), which saves the sentry stack state and executes SYSRET to switch to ring 3 and execute user code.
  3. User code eventually executes SYSCALL, and the core switches to (guest) ring 0 and jumps to sysenter.
  4. sysenter restores the sentry stack state and returns. This effectively makes the sysret() call in SwitchToUser "return", and the sentry runs the remainder of SwitchToUser.
  5. This ultimately makes Platform.Switch return in the core sentry, where we ultimately handle the syscall by selecting an implementation from the syscall table.
   
 
With the KVM platform, the sentry runs in both host ring 3 (HR3) and guest ring 0 (GR0), depending on the current context. It runs in GR0 before running user code because it must be to execute SYSRET/IRET to switch to GR3 user code. It runs in HR3 when making a syscall to the host kernel because it must be for the host kernel to intercept the syscall. It never runs in host ring 0, though of course KVM syscalls and VM exits and other host syscalls are handled by the standard host Linux kernel in host ring 0.
 
WARNING: DATA RACE
Write at 0x00c00014f0c8 by goroutine 332:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).WriteFromBlocks()
      pkg/sentry/fs/tty/queue.go:246 +0x2a6
  gvisor.googlesource.com/gvisor/pkg/sentry/safemem.Writer.WriteFromBlocks-fm()
      pkg/sentry/safemem/io.go:46 +0x75
  gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withInternalMappings()
      pkg/sentry/mm/io.go:503 +0x8ac
  gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withVecInternalMappings()
      pkg/sentry/mm/io.go:572 +0x964
  gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).CopyInTo()
      pkg/sentry/mm/io.go:309 +0x1f1
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).write()
      pkg/sentry/usermem/usermem.go:543 +0x164
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).inputQueueWrite()
      pkg/sentry/fs/tty/line_discipline.go:205 +0x147
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Write()
      pkg/sentry/fs/tty/master.go:141 +0x11c
  gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*File).Writev()
      pkg/sentry/fs/file.go:314 +0x1fc
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.writev()
      pkg/sentry/syscalls/linux/sys_write.go:261 +0xe0
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Write()
      pkg/sentry/syscalls/linux/sys_write.go:71 +0x293
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
      pkg/sentry/kernel/task_syscall.go:165 +0x407
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
      pkg/sentry/kernel/task_syscall.go:283 +0xb4
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
      pkg/sentry/kernel/task_syscall.go:244 +0x109
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
      pkg/sentry/kernel/task_syscall.go:219 +0x1b6
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:215 +0x1852
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x2e5

Previous read at 0x00c00014f0c8 by goroutine 113:
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).masterReadiness()
      pkg/sentry/fs/tty/queue.go:121 +0x43
  gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Readiness()
      pkg/sentry/fs/tty/master.go:131 +0x71
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.(*PollFD).initReadiness()
      pkg/sentry/fs/file.go:199 +0x2d0
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.Poll()
      pkg/sentry/syscalls/polling.go:96 +0x139
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.doPoll()
      pkg/sentry/syscalls/linux/sys_poll.go:70 +0x2ac
  gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Ppoll()
      pkg/sentry/syscalls/linux/sys_poll.go:343 +0x113
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
      pkg/sentry/kernel/task_syscall.go:165 +0x407
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
      pkg/sentry/kernel/task_syscall.go:283 +0xb4
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
      pkg/sentry/kernel/task_syscall.go:244 +0x109
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
      pkg/sentry/kernel/task_syscall.go:219 +0x1b6
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
      pkg/sentry/kernel/task_run.go:215 +0x1852
  gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
      pkg/sentry/kernel/task_run.go:91 +0x2e5

 

 

// SyscallTable is a lookup table of system calls.
//
// Note that a SyscallTable is not savable directly. Instead, they are saved as
// an OS/Arch pair and lookup happens again on restore.
type SyscallTable struct {
    // OS is the operating system that this syscall table implements.
    OS abi.OS

    // Arch is the architecture that this syscall table targets.
    Arch arch.Arch

    // The OS version that this syscall table implements.
    Version Version

    // AuditNumber is a numeric constant that represents the syscall table. If
    // non-zero, auditNumber must be one of the AUDIT_ARCH_* values defined by
    // linux/audit.h.
    AuditNumber uint32

    // Table is the collection of functions.
    Table map[uintptr]Syscall

    // lookup is a fixed-size array that holds the syscalls (indexed by
    // their numbers). It is used for fast look ups.
    lookup []SyscallFn

    // Emulate is a collection of instruction addresses to emulate. The
    // keys are addresses, and the values are system call numbers.
    Emulate map[usermem.Addr]uintptr

    // The function to call in case of a missing system call.
    Missing MissingFn

    // Stracer traces this syscall table.
    Stracer Stracer

    // External is used to handle an external callback.
    External func(*Kernel)

    // ExternalFilterBefore is called before External is called before the syscall is executed.
    // External is not called if it returns false.
    ExternalFilterBefore func(*Task, uintptr, arch.SyscallArguments) bool

    // ExternalFilterAfter is called before External is called after the syscall is executed.
    // External is not called if it returns false.
    ExternalFilterAfter func(*Task, uintptr, arch.SyscallArguments) bool

    // FeatureEnable stores the strace and one-shot enable bits.
    FeatureEnable SyscallFlagsTable
}

 

 

 

// Init initializes the system call table.
//
// This should normally be called only during registration.
func (s *SyscallTable) Init() {
    if s.Table == nil {
        // Ensure non-nil lookup table.
        s.Table = make(map[uintptr]Syscall)
    }
    if s.Emulate == nil {
        // Ensure non-nil emulate table.
        s.Emulate = make(map[usermem.Addr]uintptr)
    }

    max := s.MaxSysno() // Checked during RegisterSyscallTable.

    // Initialize the fast-lookup table.
    s.lookup = make([]SyscallFn, max+1)
    for num, sc := range s.Table {
        s.lookup[num] = sc.Fn    //syscll生成lookup
    }

    // Initialize all features.
    s.FeatureEnable.init(s.Table, max)
}

 

gvisor/pkg/sentry/syscalls/linux/linux64.go

// AMD64 is a table of Linux amd64 syscall API with the corresponding syscall
// numbers from Linux 4.4.
var AMD64 = &kernel.SyscallTable{
        OS:   abi.Linux,
        Arch: arch.AMD64,
        Version: kernel.Version{
                // Version 4.4 is chosen as a stable, longterm version of Linux, which
                // guides the interface provided by this syscall table. The build
                // version is that for a clean build with default kernel config, at 5
                // minutes after v4.4 was tagged.
                Sysname: LinuxSysname,
                Release: LinuxRelease,
                Version: LinuxVersion,
        },
        AuditNumber: linux.AUDIT_ARCH_X86_64,
        Table: map[uintptr]kernel.Syscall{
                0:   syscalls.Supported("read", Read),
                1:   syscalls.Supported("write", Write),
                2:   syscalls.PartiallySupported("open", Open, "Options O_DIRECT, O_NOATIME, O_PATH, O_TMPFILE, O_SYNC are not supported.", nil),
                3:   syscalls.Supported("close", Close),
                4:   syscalls.Supported("stat", Stat),
                5:   syscalls.Supported("fstat", Fstat),
                6:   syscalls.Supported("lstat", Lstat),
                7:   syscalls.Supported("poll", Poll),
                8:   syscalls.Supported("lseek", Lseek),
                9:   syscalls.PartiallySupported("mmap", Mmap, "Generally supported with exceptions. Options MAP_FIXED_NOREPLACE, MAP_SHARED_VALIDATE, MAP_SYNC MAP_GROWSDOWN, MAP_HUGETLB are not supported.", nil),
                10:  syscalls.Supported("mprotect", Mprotect),
                11:  syscalls.Supported("munmap", Munmap),
                12:  syscalls.Supported("brk", Brk),
                13:  syscalls.Supported("rt_sigaction", RtSigaction),
                14:  syscalls.Supported("rt_sigprocmask", RtSigprocmask),
                15:  syscalls.Supported("rt_sigreturn", RtSigreturn),
                16:  syscalls.PartiallySupported("ioctl", Ioctl, "Only a few ioctls are implemented for backing devices and file systems.", nil),
                17:  syscalls.Supported("pread64", Pread64),
                18:  syscalls.Supported("pwrite64", Pwrite64),
                19:  syscalls.Supported("readv", Readv),
                20:  syscalls.Supported("writev", Writev),
                21:  syscalls.Supported("access", Access),
                22:  syscalls.Supported("pipe", Pipe),
                23:  syscalls.Supported("select", Select),
                24:  syscalls.Supported("sched_yield", SchedYield),
                25:  syscalls.Supported("mremap", Mremap),

 

 

 

 

goroutine 974 [running]:
panic(0x10a1140, 0xc00043c070)
    GOROOT/src/runtime/panic.go:1064 +0x470 fp=0xc000985508 sp=0xc000985450 pc=0x437030
gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inodeRefs).IncRef(0xc000d6e008)
    bazel-out/k8-fastbuild-ST-3bfd66f45e612c1a5c797474a25664e227d81bf914f3b08a40e00b2e2692afa4/bin/pkg/sentry/fsimpl/tmpfs/inode_refs.go:88 +0x18c fp=0xc000985580 sp=0xc000985508 pc=0x92828c
gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inode).incRef(...)
    pkg/sentry/fsimpl/tmpfs/tmpfs.go:512
gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).IncRef(0xc0000c4aa0)
    pkg/sentry/fsimpl/tmpfs/tmpfs.go:357 +0x49 fp=0xc000985598 sp=0xc000985580 pc=0x92ef89
gvisor.dev/gvisor/pkg/sentry/vfs.(*Dentry).IncRef(...)
    pkg/sentry/vfs/dentry.go:150
gvisor.dev/gvisor/pkg/sentry/vfs.(*FileDescription).Init(0xc000d66500, 0x140d420, 0xc000d66500, 0xc000008241, 0xc000532660, 0xc0000c4aa0, 0xc000985624, 0x47a03f, 0xc000557358)
    pkg/sentry/vfs/file_description.go:151 +0x167 fp=0xc0009855c0 sp=0xc000985598 pc=0x7d3c87
gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).open(0xc0000c4aa0, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0xc000985878, 0x1, 0x0, 0x0, 0x0)
    pkg/sentry/fsimpl/tmpfs/filesystem.go:584 +0x1dd fp=0xc000985660 sp=0xc0009855c0 pc=0x923abd
gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*filesystem).OpenAt(0xc000557300, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0x8241, 0x0, 0x0, 0x0)
    pkg/sentry/fsimpl/tmpfs/filesystem.go:519 +0xa1e fp=0xc000985858 sp=0xc000985660 pc=0x92309e
gvisor.dev/gvisor/pkg/sentry/vfs.(*VirtualFilesystem).OpenAt(0xc000228908, 0x1402d60, 0xc000bdaa80, 0xc000cec300, 0xc000985aa0, 0xc000985a88, 0x100, 0xc000532420, 0xc0002ac000)
    pkg/sentry/vfs/vfs.go:515 +0x1ee fp=0xc0009859e8 sp=0xc000985858 pc=0x7ebe6e
gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.openat(0xc000bdaa80, 0x2b4bffffff9c, 0x20000180, 0x241, 0x0, 0x0, 0x0, 0x0, 0x0)
    pkg/sentry/syscalls/linux/vfs2/filesystem.go:219 +0x2bc fp=0xc000985b38 sp=0xc0009859e8 pc=0xe4d2bc
gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.Creat(0xc000bdaa80, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    pkg/sentry/syscalls/linux/vfs2/filesystem.go:200 +0x71 fp=0xc000985b90 sp=0xc000985b38 pc=0xe4cfb1
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0xea72d7, 0x1272f60, ...)
    pkg/sentry/kernel/task_syscall.go:116 +0x1b9 fp=0xc000985c50 sp=0xc000985b90 pc=0xa470f9
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    pkg/sentry/kernel/task_syscall.go:291 +0x70 fp=0xc000985cd8 sp=0xc000985c50 pc=0xa48410
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    pkg/sentry/kernel/task_syscall.go:238 +0xb4 fp=0xc000985d38 sp=0xc000985cd8 pc=0xa47eb4
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall(0xc000bdaa80, 0x2, 0xc000bdaa80)
    pkg/sentry/kernel/task_syscall.go:205 +0x198 fp=0xc000985e08 sp=0xc000985d38 pc=0xa47798
gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute(0x0, 0xc000bdaa80, 0x13d5ba0, 0x0)
    pkg/sentry/kernel/task_run.go:327 +0xd8c fp=0xc000985f60 sp=0xc000985e08 pc=0xa3a10c
gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run(0xc000bdaa80, 0x2d)
    pkg/sentry/kernel/task_run.go:100 +0x1e2 fp=0xc000985fd0 sp=0xc000985f60 pc=0xa38c02
runtime.goexit()
    src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000985fd8 sp=0xc000985fd0 pc=0x4705a1
created by gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).Start
    pkg/sentry/kernel/task_start.go:374 +0x116

 

 

 

 

pkg/sentry/syscalls/linux/sys_thread.go

 

 

// Fork implements Linux syscall fork(2).
func Fork(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) {
    // "A call to fork() is equivalent to a call to clone(2) specifying flags
    // as just SIGCHLD." - fork(2)
    return clone(t, int(syscall.SIGCHLD), 0, 0, 0, 0)
}

 

pkg/sentry/kernel/task_clone.go

func clone(t *kernel.Task, flags int, stack usermem.Addr, parentTID usermem.Addr, childTID usermem.Addr, tls usermem.Addr) (uintptr, *kernel.SyscallControl, error) {
    opts := kernel.CloneOptions{
        SharingOptions: kernel.SharingOptions{
            NewAddressSpace:     flags&syscall.CLONE_VM == 0,
            NewSignalHandlers:   flags&syscall.CLONE_SIGHAND == 0,
            NewThreadGroup:      flags&syscall.CLONE_THREAD == 0,
            TerminationSignal:   linux.Signal(flags & exitSignalMask),
            NewPIDNamespace:     flags&syscall.CLONE_NEWPID == syscall.CLONE_NEWPID,
            NewUserNamespace:    flags&syscall.CLONE_NEWUSER == syscall.CLONE_NEWUSER,
            NewNetworkNamespace: flags&syscall.CLONE_NEWNET == syscall.CLONE_NEWNET,
            NewFiles:            flags&syscall.CLONE_FILES == 0,
            NewFSContext:        flags&syscall.CLONE_FS == 0,
            NewUTSNamespace:     flags&syscall.CLONE_NEWUTS == syscall.CLONE_NEWUTS,
            NewIPCNamespace:     flags&syscall.CLONE_NEWIPC == syscall.CLONE_NEWIPC,
        },
        Stack:         stack,
        SetTLS:        flags&syscall.CLONE_SETTLS == syscall.CLONE_SETTLS,
        TLS:           tls,
        ChildClearTID: flags&syscall.CLONE_CHILD_CLEARTID == syscall.CLONE_CHILD_CLEARTID,
        ChildSetTID:   flags&syscall.CLONE_CHILD_SETTID == syscall.CLONE_CHILD_SETTID,
        ChildTID:      childTID,
        ParentSetTID:  flags&syscall.CLONE_PARENT_SETTID == syscall.CLONE_PARENT_SETTID,
        ParentTID:     parentTID,
        Vfork:         flags&syscall.CLONE_VFORK == syscall.CLONE_VFORK,
        Untraced:      flags&syscall.CLONE_UNTRACED == syscall.CLONE_UNTRACED,
        InheritTracer: flags&syscall.CLONE_PTRACE == syscall.CLONE_PTRACE,
    }
    ntid, ctrl, err := t.Clone(&opts)
    return uintptr(ntid), ctrl, err
}

 

  log.Infof("Process should have started...")
    l.watchdog.Start()
    return l.k.Start()

 

ype Loader struct {
    // k is the kernel.
    k *kernel.Kernel

 

 

//
// threadID a dummy value set to the task's TID in the root PID namespace to
// make it visible in stack dumps. A goroutine for a given task can be identified
// searching for Task.run()'s argument value.
func (t *Task) run(threadID uintptr) {
        atomic.StoreInt64(&t.goid, goid.Get())

        // Construct t.blockingTimer here. We do this here because we can't
        // reconstruct t.blockingTimer during restore in Task.afterLoad(), because
        // kernel.timekeeper.SetClocks() hasn't been called yet.
        blockingTimerNotifier, blockingTimerChan := ktime.NewChannelNotifier()
        t.blockingTimer = ktime.NewTimer(t.k.MonotonicClock(), blockingTimerNotifier)
        defer t.blockingTimer.Destroy()
        t.blockingTimerChan = blockingTimerChan

        // Activate our address space.
        t.Activate()
        // The corresponding t.Deactivate occurs in the exit path
        // (runExitMain.execute) so that when
        // Platform.CooperativelySharesAddressSpace() == true, we give up the
        // AddressSpace before the task goroutine finishes executing.

        // If this is a newly-started task, it should check for participation in
        // group stops. If this is a task resuming after restore, it was
        // interrupted by saving. In either case, the task is initially
        // interrupted.
        t.interruptSelf()

        for {
                // Explanation for this ordering:
                //
                // - A freshly-started task that is stopped should not do anything
                // before it enters the stop.
                //
                // - If taskRunState.execute returns nil, the task goroutine should
                // exit without checking for a stop.
                //
                // - Task.Start won't start Task.run if t.runState is nil, so this
                // ordering is safe.
                t.doStop()
                t.runState = t.runState.execute(t)
                if t.runState == nil {
                        t.accountTaskGoroutineEnter(TaskGoroutineNonexistent)
                        t.goroutineStopped.Done()
                        t.tg.liveGoroutines.Done()
                        t.tg.pidns.owner.liveGoroutines.Done()
                        t.tg.pidns.owner.runningGoroutines.Done()
                        t.p.Release()

                        // Deferring this store triggers a false positive in the race
                        // detector (https://github.com/golang/go/issues/42599).
                        atomic.StoreInt64(&t.goid, 0)
                        // Keep argument alive because stack trace for dead variables may not be correct.
                        runtime.KeepAlive(threadID)
                        return
                }
        }
}

 

func (ts *TaskSet) newTask(cfg *TaskConfig) (*Task, error) {
        tg := cfg.ThreadGroup
        image := cfg.TaskImage
        t := &Task{
                taskNode: taskNode{
                        tg:       tg,
                        parent:   cfg.Parent,
                        children: make(map[*Task]struct{}),
                },
                runState:           (*runApp)(nil),
                interruptChan:      make(chan struct{}, 1),
                signalMask:         cfg.SignalMask,
                signalStack:        arch.SignalStack{Flags: arch.SignalStackFlagDisable},
                image:              *image,
                fsContext:          cfg.FSContext,
                fdTable:            cfg.FDTable,
                p:                  cfg.Kernel.Platform.NewContext(),
                k:                  cfg.Kernel,
                ptraceTracees:      make(map[*Task]struct{}),
                allowedCPUMask:     cfg.AllowedCPUMask.Copy(),
                ioUsage:            &usage.IO{},
                niceness:           cfg.Niceness,
                netns:              cfg.NetworkNamespace,
                utsns:              cfg.UTSNamespace,
                ipcns:              cfg.IPCNamespace,
                abstractSockets:    cfg.AbstractSocketNamespace,
                mountNamespaceVFS2: cfg.MountNamespaceVFS2,
                rseqCPU:            -1,
                rseqAddr:           cfg.RSeqAddr,
                rseqSignature:      cfg.RSeqSignature,
                futexWaiter:        futex.NewWaiter(),
                containerID:        cfg.ContainerID,
        }

 

 

 

 pkg/sentry/kernel/task_run.go

 

func (*runApp) execute(t *Task) taskRunState {
    ...
    switch err {
    case nil:
        // Handle application system call.
        return t.doSyscall()
    ...
}

 

doSyscall pkg/sentry/kernel/task_syscall.go

 

// doSyscall is the entry point for an invocation of a system call specified by
// the current state of t's registers.
//
// The syscall path is very hot; avoid defer.
func (t *Task) doSyscall() taskRunState {
    sysno := t.Arch().SyscallNo()
    args := t.Arch().SyscallArgs()

    // Tracers expect to see this between when the task traps into the kernel
    // to perform a syscall and when the syscall is actually invoked.
    // This useless-looking temporary is needed because Go.
    tmp := uintptr(syscall.ENOSYS)
    t.Arch().SetReturn(-tmp)

    // Check seccomp filters. The nil check is for performance (as seccomp use
    // is rare), not needed for correctness.
    if t.syscallFilters.Load() != nil {
        switch r := t.checkSeccompSyscall(int32(sysno), args, usermem.Addr(t.Arch().IP())); r {
        case linux.SECCOMP_RET_ERRNO, linux.SECCOMP_RET_TRAP:
            t.Debugf("Syscall %d: denied by seccomp", sysno)
            return (*runSyscallExit)(nil)
        case linux.SECCOMP_RET_ALLOW:
            // ok
        case linux.SECCOMP_RET_KILL_THREAD:
            t.Debugf("Syscall %d: killed by seccomp", sysno)
            t.PrepareExit(ExitStatus{Signo: int(linux.SIGSYS)})
            return (*runExit)(nil)
        case linux.SECCOMP_RET_TRACE:
            t.Debugf("Syscall %d: stopping for PTRACE_EVENT_SECCOMP", sysno)
            return (*runSyscallAfterPtraceEventSeccomp)(nil)
        default:
            panic(fmt.Sprintf("Unknown seccomp result %d", r))
        }
    }

    return t.doSyscallEnter(sysno, args)
}

 

func (t *Task) executeSyscall(sysno uintptr, args arch.SyscallArguments) (rval uintptr, ctrl *SyscallControl, err error) {
    s := t.SyscallTable()

    fe := s.FeatureEnable.Word(sysno)

    var straceContext interface{}
    if bits.IsAnyOn32(fe, StraceEnableBits) {
        straceContext = s.Stracer.SyscallEnter(t, sysno, args, fe)
    }

    if bits.IsOn32(fe, ExternalBeforeEnable) && (s.ExternalFilterBefore == nil || s.ExternalFilterBefore(t, sysno, args)) {
        t.invokeExternal()
        // Ensure we check for stops, then invoke the syscall again.
        ctrl = ctrlStopAndReinvokeSyscall
    } else {
        fn := s.Lookup(sysno)
        if fn != nil {
            // Call our syscall implementation.
            rval, ctrl, err = fn(t, args)
        } else {
            // Use the missing function if not found.
            rval, err = t.SyscallTable().Missing(t, sysno, args)
        }
    }

    if bits.IsOn32(fe, ExternalAfterEnable) && (s.ExternalFilterAfter == nil || s.ExternalFilterAfter(t, sysno, args)) {
        t.invokeExternal()
        // Don't reinvoke the syscall.
    }

    if bits.IsAnyOn32(fe, StraceEnableBits) {
        s.Stracer.SyscallExit(straceContext, t, sysno, rval, err)
    }

    return
}

 

 

pkg/sentry/syscalls/linux/vfs2/fscontext.go

// Chdir implements Linux syscall chdir(2).
func Chdir(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) {
        addr := args[0].Pointer()

        path, err := copyInPath(t, addr)
        if err != nil {
                return 0, nil, err
        }
        tpop, err := getTaskPathOperation(t, linux.AT_FDCWD, path, disallowEmptyPath, followFinalSymlink)
        if err != nil {
                return 0, nil, err
        }
        defer tpop.Release(t)

        vd, err := t.Kernel().VFS().GetDentryAt(t, t.Credentials(), &tpop.pop, &vfs.GetDentryOptions{
                CheckSearchable: true,
        })
        if err != nil {
                return 0, nil, err
        }
        t.FSContext().SetWorkingDirectoryVFS2(t, vd)
        vd.DecRef(t)
        return 0, nil, nil
}

 

root@cloud:~/onlyGvisor# ps -elf | grep 947729
4 S nobody   947729 947703  0  80   0 - 68452015012 sys_po 11:17 ?  00:00:00 runsc-sandbox --root=/var/run/docker/runtime-runsc-kvm/moby --log=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88/log.json --log-format=json --platform=kvm --log-fd=3 boot --bundle=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88 --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --io-fds=9 --io-fds=10 --io-fds=11 --device-fd=12 --stdio-fds=13 --stdio-fds=14 --stdio-fds=15 --pidns=true --cpu-num 64 b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88
0 S root     948093 947631  0  80   0 -  1418 pipe_r 11:33 pts/3    00:00:00 grep --color=auto 947729
root@cloud:~/onlyGvisor# docker inspect test2 | grep Pid | head -n 1
            "Pid": 947729,
root@cloud:~/onlyGvisor# 

 

gdb Socket

root@cloud:/gvisor# docker run --runtime=runsc-kvm --rm --name=test -d alpine sleep 1000
1076ade686c4ccea6e8c40e6d6881e4f5e9c403ff21aab9febf4557218a10e17
root@cloud:/gvisor# docker inspect test | grep Pid | head -n 1
            "Pid": 927424,
root@cloud:/gvisor# docker exec -it test ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=1 ttl=42 time=57.058 ms
64 bytes from 8.8.8.8: seq=2 ttl=42 time=56.148 ms
64 bytes from 8.8.8.8: seq=3 ttl=42 time=56.321 ms
64 bytes from 8.8.8.8: seq=4 ttl=42 time=69.416 ms
64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.813 ms
64 bytes from 8.8.8.8: seq=6 ttl=42 time=68.444 ms
64 bytes from 8.8.8.8: seq=7 ttl=42 time=56.031 ms
^C
--- 8.8.8.8 ping statistics ---
8 packets transmitted, 7 packets received, 12% packet loss
round-trip min/avg/max = 55.813/59.890/69.416 ms
root@cloud:/gvisor# docker exec -it test ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes

64 bytes from 8.8.8.8: seq=0 ttl=42 time=111.545 ms
64 bytes from 8.8.8.8: seq=1 ttl=42 time=55.150 ms
64 bytes from 8.8.8.8: seq=2 ttl=42 time=55.362 ms
64 bytes from 8.8.8.8: seq=3 ttl=42 time=58.652 ms
64 bytes from 8.8.8.8: seq=4 ttl=42 time=56.521 ms
64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.958 ms
64 bytes from 8.8.8.8: seq=6 ttl=42 time=55.386 ms
64 bytes from 8.8.8.8: seq=7 ttl=42 time=54.869 ms
64 bytes from 8.8.8.8: seq=8 ttl=42 time=54.373 ms
64 bytes from 8.8.8.8: seq=9 ttl=42 time=74.912 ms
64 bytes from 8.8.8.8: seq=10 ttl=42 time=55.755 ms

 

root@cloud:/mycontainer# dlv attach 927424
Type 'help' for list of commands.
(dlv) b Socket
Command failed: Location "Socket" ambiguous: golang.org/x/sys/unix.Socket, syscall.Socket, type..eq.gvisor.dev/gvisor/pkg/unet.Socket, gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket…
(dlv) c
received SIGINT, stopping process (will not forward signal)
> syscall.Syscall6() src/syscall/asm_linux_arm64.s:43 (PC: 0x8dccc)
Warning: debugging optimized function
(dlv) b  linux.Socket
Breakpoint 1 set at 0x587f30 for gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172
(dlv) b netstack.Socket
Command failed: Location "netstack.Socket" ambiguous: gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket…
(dlv) b netstack.(*provider).Socket
Breakpoint 2 set at 0x647270 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket() pkg/sentry/socket/netstack/provider.go:94
(dlv) b netstack.(*providerVFS2).Socket
Breakpoint 3 set at 0x647960 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket() pkg/sentry/socket/netstack/provider_vfs2.go:38
(dlv) c
> gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172 (hits goroutine(306):1 total:1) (PC: 0x587f30)
Warning: debugging optimized function
(dlv) bt
0  0x0000000000587f30 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket
   at pkg/sentry/syscalls/linux/sys_socket.go:172
1  0x0000000000522ea4 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall
   at pkg/sentry/kernel/task_syscall.go:104
2  0x0000000000523c5c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke
   at pkg/sentry/kernel/task_syscall.go:239
3  0x00000000005238dc in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter
   at pkg/sentry/kernel/task_syscall.go:199
4  0x00000000005233e0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall
   at pkg/sentry/kernel/task_syscall.go:174
5  0x0000000000518e00 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
   at pkg/sentry/kernel/task_run.go:282
6  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
   at pkg/sentry/kernel/task_run.go:97
7  0x0000000000077c84 in runtime.goexit
   at src/runtime/asm_arm64.s:1136

 

 

entersyscall +  exitsyscall 

        entersyscall()
        bluepill(c)
        vector = c.CPU.SwitchToUser(switchOpts)
        exitsyscall()


//go:linkname entersyscall runtime.entersyscall
func entersyscall()   ------

//go:linkname exitsyscall runtime.exitsyscall
func exitsyscall()

gvisor中 entersyscall 和exitsyscall使用的是runtime的
oot@cloud:~/onlyGvisor# dlv attach 947729
Type 'help' for list of commands.
(dlv) b entersyscall
Breakpoint 1 set at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126
(dlv) c
> runtime.entersyscall() GOROOT/src/runtime/proc.go:3126 (hits goroutine(203):1 total:1) (PC: 0x72780)
Warning: debugging optimized function
(dlv) bt
0  0x0000000000072780 in runtime.entersyscall
   at GOROOT/src/runtime/proc.go:3126
1  0x000000000008dcb0 in syscall.Syscall6
   at src/syscall/asm_linux_arm64.s:35
2  0x00000000005458e8 in gvisor.dev/gvisor/pkg/fdnotifier.epollWait
   at pkg/fdnotifier/poll_unsafe.go:76
3  0x0000000000545564 in gvisor.dev/gvisor/pkg/fdnotifier.(*notifier).waitAndNotify
   at pkg/fdnotifier/fdnotifier.go:149
4  0x0000000000077c84 in runtime.goexit
   at src/runtime/asm_arm64.s:1136
(dlv) clearall
Breakpoint 1 cleared at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126
(dlv) b pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
Breakpoint 2 set at 0x87f504 for gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
(dlv) c
> gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 (hits goroutine(94):1 total:1) (PC: 0x87f504)
Warning: debugging optimized function
(dlv) bt
0  0x000000000087f504 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser
   at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
1  0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch
   at pkg/sentry/platform/kvm/context.go:75
2  0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
   at pkg/sentry/kernel/task_run.go:271
3  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
   at pkg/sentry/kernel/task_run.go:97
4  0x0000000000077c84 in runtime.goexit
   at src/runtime/asm_arm64.s:1136
(dlv) s
> gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249 (PC: 0x87f508)
Warning: debugging optimized function
(dlv) bt
0  0x000000000087f508 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser
   at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249
1  0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch
   at pkg/sentry/platform/kvm/context.go:75
2  0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
   at pkg/sentry/kernel/task_run.go:271
3  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
   at pkg/sentry/kernel/task_run.go:97
4  0x0000000000077c84 in runtime.goexit
   at src/runtime/asm_arm64.s:1136
(dlv) s
Stopped at: 0x881b80
=>no source available

 

posted on 2021-01-08 14:55  tycoon3  阅读(742)  评论(0)    收藏  举报

导航