Minimal Docker

A tiny container runtime built in Go to understand namespaces, chroot, proc mounts, and basic cgroup v2 limits.

GoLinux

I wanted to understand what a container runtime is really doing when you strip away everything extra. Not image layers, registries, or orchestration. Just the core mechanics: give a process its own hostname, its own PID tree, a different root filesystem, and a few resource limits.

So I built a tiny Docker-like toy in Go. It only runs on Linux, and it is very much not a real Docker replacement.

The parent-child split

The CLI is small: minimal-docker run <rootfs> <command>. The interesting part is that the parent process does not set up the container directly. It re-executes the same binary with a hidden child command and starts that child with Linux namespace flags.

cmd := exec.Command("/proc/self/exe", append([]string{"child", rootfs, command}, args...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
  Cloneflags: syscall.CLONE_NEWUTS |
    syscall.CLONE_NEWPID |
    syscall.CLONE_NEWNS,
}

That split makes the code easier to follow.

  • The parent stays on the host and acts like a launcher
  • The child starts after namespace creation and finishes the container setup

Using the same binary for both jobs keeps the project small and still makes the control flow obvious.

Namespaces do most of the magic

Most of the trick is just three Linux namespaces:

  • CLONE_NEWUTS gives the process its own hostname
  • CLONE_NEWPID gives it its own PID namespace
  • CLONE_NEWNS gives it its own mount namespace

Inside the child process, I set the hostname to minimal-docker, make mount propagation private, and then prepare the filesystem view for the final command.

syscall.Sethostname([]byte("minimal-docker"))
syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, "")

That MS_PRIVATE | MS_REC line matters more than it looks. Without it, mount changes can leak back to the host or behave in ways that are annoying to debug.

Changing the root filesystem

Once the child is inside its own namespaces, it switches into the provided root filesystem with chroot, changes directory to /, mounts /proc, and then executes the requested command.

syscall.Chroot(rootfs)
os.Chdir("/")
syscall.Mount("proc", "/proc", "proc", 0, "")
syscall.Exec(command, append([]string{command}, args...), os.Environ())

This is the point where the demo actually starts to feel like a container. If the rootfs only contains BusyBox and a few symlinks, then /bin/sh, /bin/ls, and /bin/ps all come from that tiny filesystem, not from the host.

It also makes one thing obvious pretty quickly: the command has to exist inside the rootfs. There is no magic here. If /bin/sh is missing, it just fails.

Basic cgroup v2 limits

Resource limits are surprisingly mundane. After starting the child, the parent creates a small cgroup under /sys/fs/cgroup, writes a few limits, and then adds the child PID to cgroup.procs.

os.WriteFile(filepath.Join(cgroupPath, "memory.max"), []byte("268435456"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "pids.max"), []byte("64"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "cpu.max"), []byte("50000 100000"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "cgroup.procs"), []byte(strconv.Itoa(cmd.Process.Pid)), 0o644)

The limits are intentionally basic:

  • memory.max caps memory at 256 MB
  • pids.max limits the number of processes
  • cpu.max gives the container roughly half of one CPU

That was one of the more useful takeaways for me. A lot of resource isolation is just writing the right values to the cgroup filesystem. The harder part is knowing which files matter and what tradeoffs they imply.

What I learned

  • A container is mostly a normal process started with the right Linux primitives. The container feeling comes from namespaces, filesystem setup, and cgroups working together.
  • Re-executing the same binary is a clean way to separate launcher logic from container setup. Once I saw that pattern, a lot of container runtime code became easier to read.
  • chroot alone is not a container. It changes the filesystem view, but without namespaces and cgroups you do not get the isolation people usually mean by containers.
  • /proc has to be mounted inside the new root if you want tools like ps to reflect the container's PID namespace instead of the host's view.
  • This kind of project is great for learning, but nowhere near production-safe. Real runtimes need user namespaces, better mount handling, signal forwarding, cleanup, capabilities, seccomp, networking, and a lot more defensive work.

Small project, but it made container internals feel a lot less mysterious.