
Docker run 命令实现

本文需要实现第一个命令 Mydocker run,类似于 docker run -it [command] 命令。通过创建新的 Namespace 来对新进程进行视图隔离。


  • 命令行参数解析的问题,具体实现时通过 github.com/urfave/cli 库来实现对用户输入命令行的解析,需要解析的命令包括 runinit 命令;
  • 不同容器内系统信息的隔离,以及如何获取系统信息(可以通过 mount /proc 实现);


docker run执行流程

Mydocker 中需要解析用户输入的命令行参数列表,比如 Mydocker run -it /bin/sh,首要的是识别并解析 run 参数。

run 参数解析函数中需要创建并初始化容器进程,不同的容器进程需要基于 Namespace 隔离。具体实现通过调用 /proc/self/exe 可执行程序(/proc/self 代表当前进程),实现容器进程的新建操作;调用 /proc/self/exe init 以传递 init 命令行参数实现容器进程的初始化操作(由 fork 出的子进程执行初始化操作)。

在容器进程初始化完毕后,需要开始执行具体命令例如 /bin/sh,父进程需要将命令行参数传递给子进程,这里采用的是匿名管道方式来实现。

子进程读取管道数据,通过 execve(fileName, argv, env) 系统调用替换当前进程的镜像、数据和堆栈等信息,在完全隔离的内存空间中执行具体命令。


主函数体中定义了容器相关的核心命令及其解析方式,根据 urfave/cli 库来实现。

// main.go

package main

import (

	log "github.com/sirupsen/logrus"

const usage = `mydocker is a simple container runtime implementation.
			   The purpose of this project is to learn how docker works and how to write a docker by ourselves
			   Enjoy it, just for fun.`

func main() {
	app := cli.NewApp()
	app.Name = "Mydocker"
	app.Usage = usage

	// init command params,including initCommand、runCommand
	app.Commands = []cli.Command{

	// init logrus configs
	app.Before = func(ctx *cli.Context) error {
		return nil

	if err := app.Run(os.Args); err != nil {


  • 创建 urfave/cli 对象,定义命令参数解析逻辑;
  • 定义日志输出格式;


在使用 docker 时,首先通过命令行 docker run xxx 命令启动一个容器并执行相应命令,命令格式为 docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

  • -a stdin: 指定标准输入输出内容类型,可选 STDIN/STDOUT/STDERR 三项;
  • -d: 后台运行容器,并返回容器ID;
  • -i: 以交互模式运行容器,通常与 -t 同时使用;
  • -P: 随机端口映射,容器内部端口随机映射到主机的端口
  • -p: 指定端口映射,格式为:主机(宿主)端口:容器端口
  • -t: 为容器重新分配一个伪输入终端,通常与 -i 同时使用;
  • --expose=[]: 开放一个端口或一组端口;
  • --volume , -v: 绑定一个卷
import (

	log "github.com/sirupsen/logrus"


 * start procedure:
 * 1. user exec Mydocker run by hand;
 * 2. urfave/cli parse user Commands;
 * 3. call runCommand method to build cmds Object;
 * 4. NewParentProcess method return cmds Object to runCommand method;
 * 5. according to cmds paramters, /proc/self/exe init will execute mydocker command, which inilizates container's environment
 * 6. all init procedures end;

 * for Example: Mydocker run xxx -it /bin/bash
 * container start command
var runCommand = cli.Command{
	Name: "run",
	Usage: `Create a container with namespace and cgroups limit
			mydocker run -it [command]`,
	Flags: []cli.Flag{
			Name:  "it",
			Usage: "enable tty",
			Name:  "d",
			Usage: "detach container",
			Name:  "m",
			Usage: "memory limit",
			Name:  "cpushare",
			Usage: "cpushare limit",
			Name:  "cpuset",
			Usage: "cpuset limit",
			Name:  "name",
			Usage: "container name",
			Name:  "v",
			Usage: "volume",
			Name:  "e",
			Usage: "set environment",
			Name:  "net",
			Usage: "container network",
			Name:  "p",
			Usage: "port mapping",
	 * parse commandline, tty represents allow bash windows
	Action: func(context *cli.Context) error {
		if len(context.Args()) < 1 {
			return fmt.Errorf("missing container command")

		// collect params after it
		var cmdArray []string
		for _, arg := range context.Args() {
			cmdArray = append(cmdArray, arg)
		// i: use console to interact
		// t: tty, allow bash login
		tty := context.Bool("it")
		// name: containerName
		containerName := context.String("name")
		// environments
		envSlice := context.StringSlice("envSlice")
		imageName := cmdArray[0]
		log.Infof("exec run command, bashMode:%v, imageName:%v", tty, imageName)
		 * start create container process
		Run(tty, cmdArray, containerName, imageName, envSlice)
		return nil

 * container inilization command
var initCommand = cli.Command{
	Name:  "init",
	Usage: "Init container process run user's process in container. Do not call it outside",
	 * init process resource after create container
	Action: func(context *cli.Context) error {
		log.Infof("exec init command")
		return container.ContainerResourceInit()

需要注意的是,在执行完 run 参数对应的解析函数后,会通过 /proc/self/exe init 执行新的可执行程序并输入 init 命令行参数,新的子进程会执行 init 参数对应的解析函数。

init 参数的解析函数中,需要执行包括进程资源的初始化、/proc 工作目录挂载、shell 命令执行。


 * clone process which dividing by namespace, and use /proc/self/exe to init processResource
 * attention:
 * 1.only after childProcess has been inited that we can write message to writePipe by parentProcess
func Run(tty bool, cmdArray []string, containerName, imageName string, envSlice []string) {
	// init container process
	cmdProcess, writePipe := container.NewParentProcess(tty, imageName, containerName, envSlice)
	if cmdProcess == nil {
		log.Errorf("run::Run create child process failed")
	// create parentProcess —— containerProcess
	if err := cmdProcess.Start(); err != nil {
		log.Errorf("run::Run parent Start failed %v", err)
	// send parameters to childProcess after childProcess has been inilizated
	sendInitCommands(cmdArray, writePipe)
	if tty {

 * start a new process, return executable commands
 * 1.use /proc/self/exe to create child process which diving by namespace and other environment;
 * 2.use init command param to init child process;
 * 3.redirect input/output/errput;
 * perf:
 * 1.use pipe to transfer parameters between parentProcess and childProcess. Avoid out-of-buffer and console parameters too long
func NewParentProcess(tty bool, containerName, imageName string, envSlice []string) (*exec.Cmd, *os.File) {
	// create Pipe which transferring parameters between parentProcess and childProcess
	readPipe, writePipe, err := os.Pipe()
	if err != nil {
		log.Errorf("container_process::NewParentProcess new pipe failed")
		return nil, nil
	// locate /proc/self/exe executable process
	exePath, err := os.Readlink("/proc/self/exe")
	if err != nil {
		log.Errorf("container_process::NewParentProcess can't find /proc/self/exe link")
		return nil, nil
	processCmd := exec.Command(exePath, "init")
	processCmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWNET | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWIPC,
	// redirect output/input
	if tty {
		processCmd.Stdin = os.Stdin
		processCmd.Stdout = os.Stdout
		processCmd.Stderr = os.Stderr
	} else {
		// if allow process exec backgroundly, redirect output/input fd
		dirURL := fmt.Sprintf(InfoLogFormat, containerName)
		if err := os.MkdirAll(dirURL, Perm0622); err != nil {
			log.Errorf("container_process::NewParentProcess mkdir log directory failed %s", dirURL)
			return nil, nil
		logPath := dirURL + LogFileName
		file, err := os.Create(logPath)
		if err != nil {
			log.Errorf("container_process::NewParentProcess create logFile failed %s", logPath)
			return nil, nil
		processCmd.Stdout = file
	// transfer readPipe to childProcess by adding fourth fd to it
	processCmd.ExtraFiles = []*os.File{readPipe}
	return processCmd, writePipe

 * after create containerProcess, its the first process to init process's resource
 * 1.mount current process proc config;
 * 2.read commands from readPipe;
 * 3.
func ContainerResourceInit() error {
	// read parameters from readPipe
	cmdArrays := readUserCommands()
	if len(cmdArrays) == 0 {
		return errors.New("init::ContainerResourceInit userCommands is nil")
	// proc mount
	// execute commands
	path, err := exec.LookPath(cmdArrays[0])
	if err != nil {
		log.Errorf("init::ContainerResourceInit exec lookPath failed, err=%v", err)
		return err
	log.Infof("init::ContainerResourceInit execuatble path=%v", path)
	if err = syscall.Exec(path, cmdArrays[0:], os.Environ()); err != nil {
		log.Errorf("init::ContainerResourceInit exec failed, err=%v", err)
	return nil

 * mount proc fileSystem for current process
 * mountFlags:
 * 	 1.syscall.MS_NOEXEC:本文件系统中不允许运行其它程序;
 *	 2.syscall.MS_NOSUID:本系统运行程序时不允许 set-user-id、set-group-id;
 *   3.syscall.MS_NODEV:mount默认都会携带;
 * systemd 加入 linux后,mount namespace 更新为 shared by default,所以必须显式声明 mount namespace 独立于宿主机
func mountProc() {
	if err := syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
		log.Errorf("mount default namespace failed, err = %v", err)
	defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
	if err := syscall.Mount("proc", "/proc", "proc", uintptr(defaultMountFlags), ""); err != nil {
		log.Errorf("mount proc failed, err = %v", err)

/proc 文件系统是一个虚拟的文件系统,它提供了对内核和运行中进程的信息的访问,包含了系统运行时的信息(比如系统内存、mount设备信息、硬件配置等),它存在于内存中不占用外存空间。通过挂载 /proc 目录,我们可以查看到系统内核信息。

在容器环境中,为了和宿主机的 /proc 环境隔离,docker init 操作时需要重新挂载 /proc 文件系统,转化为 bash 命令对应为 mount -t proc proc /proc

syscall.Mount("proc", "/proc", "proc", uintptr(mountFlags), "")


root@mydocker:~/mydocker# ./mydocker run -it /bin/ls
{"level":"info","msg":"init come on","time":"2024-01-03T15:07:27+08:00"}
{"level":"info","msg":"command: /bin/ls","time":"2024-01-03T15:07:27+08:00"}
LICENSE  Makefile  README.md  container  example  go.mod  go.sum  main.go  main_command.go  mydocker  run.go
root@mydocker:~/mydocker# ./mydocker run -it /bin/ls
{"level":"error","msg":"fork/exec /proc/self/exe: no such file or directory","time":"2024-01-03T15:07:28+08:00"}

重复启动 docker 容器出现 /proc/self/exe 无法找到的问题,这是因为引入了 systemd 之后的 linux 系统中,mount namespace 是默认宿主机和 namespace 隔离进程间共享的。因此我们需要先将 mount 事件显示指定为 private 来避免挂载事件外泄,这样就不会破坏主机 /proc 目录数据,具体实现如下:

func mountProc() {
    // 配置 mount 操作为 private
	if err := syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
		log.Errorf("mount default namespace failed, err = %v", err)
    // mount 进程 /proc 目录
	defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
	if err := syscall.Mount("proc", "/proc", "proc", uintptr(defaultMountFlags), ""); err != nil {
		log.Errorf("mount proc failed, err = %v", err)




[root@localhost Mydocker]# go build .
[root@localhost Mydocker]# ./Mydockker run -it /bin/sh
{"level":"info","msg":"exec run command, bashMode:true, imageName:/bin/sh","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"run::sendInitCommands all commands:/bin/sh","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"exec init command","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"init::ContainerResourceInit execuatble path=/bin/sh","time":"2024-01-31T23:06:07+08:00"}

# 查看容器目录
sh-4.2# ls
container  go.sum  mainCommands.go  Mydockker
go.mod     log     main.go          run.go

# 容器内 ps -af 发现 /bin/sh 为容器内第一个进程,与预期一致
sh-4.2# ps -af
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 23:06 pts/0    00:00:00 /bin/sh
root          7      1  0 23:06 pts/0    00:00:00 ps -af
