I wanted to understand what a container runtime is really doing when you strip away everything extra. Not image layers, registries, or orchestration. Just the core mechanics: give a process its own hostname, its own PID tree, a different root filesystem, and a few resource limits.
So I built a tiny Docker-like toy in Go. It only runs on Linux, and it is very much not a real Docker replacement.
The parent-child split
The CLI is small: minimal-docker run <rootfs> <command>. The interesting part is that the parent process does not set up the container directly. It re-executes the same binary with a hidden child command and starts that child with Linux namespace flags.
cmd := exec.Command("/proc/self/exe", append([]string{"child", rootfs, command}, args...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNS,
}That split makes the code easier to follow.
- The parent stays on the host and acts like a launcher
- The child starts after namespace creation and finishes the container setup
Using the same binary for both jobs keeps the project small and still makes the control flow obvious.
Namespaces do most of the magic
Most of the trick is just three Linux namespaces:
CLONE_NEWUTSgives the process its own hostnameCLONE_NEWPIDgives it its own PID namespaceCLONE_NEWNSgives it its own mount namespace
Inside the child process, I set the hostname to minimal-docker, make mount propagation private, and then prepare the filesystem view for the final command.
syscall.Sethostname([]byte("minimal-docker"))
syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, "")That MS_PRIVATE | MS_REC line matters more than it looks. Without it, mount changes can leak back to the host or behave in ways that are annoying to debug.
Changing the root filesystem
Once the child is inside its own namespaces, it switches into the provided root filesystem with chroot, changes directory to /, mounts /proc, and then executes the requested command.
syscall.Chroot(rootfs)
os.Chdir("/")
syscall.Mount("proc", "/proc", "proc", 0, "")
syscall.Exec(command, append([]string{command}, args...), os.Environ())This is the point where the demo actually starts to feel like a container. If the rootfs only contains BusyBox and a few symlinks, then /bin/sh, /bin/ls, and /bin/ps all come from that tiny filesystem, not from the host.
It also makes one thing obvious pretty quickly: the command has to exist inside the rootfs. There is no magic here. If /bin/sh is missing, it just fails.
Basic cgroup v2 limits
Resource limits are surprisingly mundane. After starting the child, the parent creates a small cgroup under /sys/fs/cgroup, writes a few limits, and then adds the child PID to cgroup.procs.
os.WriteFile(filepath.Join(cgroupPath, "memory.max"), []byte("268435456"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "pids.max"), []byte("64"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "cpu.max"), []byte("50000 100000"), 0o644)
os.WriteFile(filepath.Join(cgroupPath, "cgroup.procs"), []byte(strconv.Itoa(cmd.Process.Pid)), 0o644)The limits are intentionally basic:
memory.maxcaps memory at 256 MBpids.maxlimits the number of processescpu.maxgives the container roughly half of one CPU
That was one of the more useful takeaways for me. A lot of resource isolation is just writing the right values to the cgroup filesystem. The harder part is knowing which files matter and what tradeoffs they imply.
What I learned
- A container is mostly a normal process started with the right Linux primitives. The container feeling comes from namespaces, filesystem setup, and cgroups working together.
- Re-executing the same binary is a clean way to separate launcher logic from container setup. Once I saw that pattern, a lot of container runtime code became easier to read.
chrootalone is not a container. It changes the filesystem view, but without namespaces and cgroups you do not get the isolation people usually mean by containers./prochas to be mounted inside the new root if you want tools likepsto reflect the container's PID namespace instead of the host's view.- This kind of project is great for learning, but nowhere near production-safe. Real runtimes need user namespaces, better mount handling, signal forwarding, cleanup, capabilities, seccomp, networking, and a lot more defensive work.
Small project, but it made container internals feel a lot less mysterious.