Nix: Bringing Order to Chaos

Imagine buying a new VPS, setting up a new VM, or finally installing Linux on that old laptop. Booting into that tty or desktop environment for the first time feels great, right? It’s almost as if you can feel how clean and pristine the system currently is. There has been no opportunity for crud to accumulate yet. ls -al ~ | wc -l might even print 3!

But then you install that first package. You tweak that configuration file in /etc. Step by step your system does more of what you want, and step by step the entropy of your system increases and it becomes harder to reason about its state. Leave the system alone for a few months and you’ll probably forget half the changes you made.

For a lot of use cases this is perfectly fine. You install docker, tweak sshd_config, and spin up some containers. The context of such a setup is small enough that you can keep it in your head for years on end. But what if that context is much larger, or changes often? This is a write-up of how Nix made my most chaotic project manageable.

The Context

One of the personal projects I’ve poured my time into is a project called STRIKER. It allows users to generate highlights from their Counter-Strike 2 matchmaking games through a Discord bot. One of the central components of STRIKER is the “recorder” which listens to a queue of recording jobs, generates the recording by automating the Counter-Strike 2 client itself¹, and uploads the final clip to a local cache.

Example output from the recorder from one of my clipping requests

If some of that made no sense to you, the important part is that we need to run and automate the Counter-Strike 2 game client. In fact, we want to run the game client inside a container because the STRIKER stack runs on a Kubernetes cluster, and we want to be able to horizontally scale the recorders if we need to.

So, how hard can it possibly be to containerize the Counter-Strike 2 game client? Let’s run through two primary pain points.

Hardware Acceleration

If we want to run hardware accelerated workloads (the game), then we need GPU support in our containers. The vendor I use for this project, NVIDIA, has the NVIDIA Container Toolkit which enables this use case. Setting this up for local development or getting hardware acceleration in Kubernetes containers is relatively straight-forward.

However, just giving the container access to a GPU isn’t enough; the game client wants to open windows and listen to input as well! For this we need a display server like X11 and a desktop environment. Launching a display server and desktop environment inside of a container has proven to be a Sisyphean task (for me), so I’ve concluded that running the it on the host and passing it through² to the recorder container works best.

Steam

Counter-Strike 2 is heavily integrated with the Steam ecosystem to the point where the game will not launch without a Steamworks socket, which the Steam client exposes. There are theoretically ways to run Steam games without running Steam itself, but that introduces another brittle component that can break with updates so simply running the Steam client in the container itself is the easiest way to go about it.

Unfortunately this leads to another issue; the Steam client is still a 32-bit application and the NVIDIA Container Toolkit does not make 32-bit driver libraries available in the container. So we’ll need to get these into the container some other way.

Okay, but weren’t we talking about Nix?

Yes! This post is about a Nix success story, but to fully appreciate how much heavy-lifting Nix can do in a situation like this, lets have a quick look at how I solved these problems before integrating Nix in the project.

My distro choice before migrating to NixOS used to be Debian, so my Kubernetes nodes ran Debian 12. To enable NVIDIA GPU hardware acceleration on a Kubernetes cluster, you would install NVIDIA drivers on the node, and install the NVIDIA GPU Operator onto the cluster, which would do node feature discovery and install the NVIDIA Container Toolkit on applicable nodes.

To expose an X11 socket to the container we have to run a display server on the host, and the easiest way to go about this is to choose a lightweight DE like XFCE upon installing Debian³.

We can install Steam in our container, along with a painstakingly curated list of required libraries, as such:

FROM debian:12.10
 
# get the steam repository keyring
RUN wget -O - https://repo.steampowered.com/steam/archive/stable/steam.gpg > /usr/share/keyrings/steam.gpg
RUN tee /etc/apt/sources.list.d/steam-stable.list <<'EOF'
deb [arch=amd64,i386 signed-by=/usr/share/keyrings/steam.gpg] https://repo.steampowered.com/steam/ stable steam
deb-src [arch=amd64,i386 signed-by=/usr/share/keyrings/steam.gpg] https://repo.steampowered.com/steam/ stable steam
EOF
 
# add 32-bit packages and update our repositories
RUN dpkg --add-architecture i386
RUN apt update
 
# add all packages required by steam
RUN apt-install -y --no-install-recommends --no-install-suggests \
    libdrm2:i386 libgl1:i386 libegl1 libxext6 libvulkan1 \
	libc6 libc6:i386 libc6-i386 libegl1 libegl1:i386 libgbm1 libgbm1:i386 \
    libgl1-mesa-dri libgl1-mesa-dri:i386 libgl1 libgl1:i386 \
    steam-libs-amd64 steam-libs-i386:i386 \
	libgl1-mesa-dri:amd64 libgl1-mesa-dri:i386 \
    libgl1-mesa-glx:amd64 libgl1-mesa-glx:i386 \
    dbus dbus-x11 zenity python3-apt xdg-user-dirs \
	file libnss3 pkexec policykit-1 pulseaudio \
	steam-launcher # <- actually installing steam here

You might read that and think “that doesn’t sound too bad!” - but stop and actually consider the magnitude of sources for all involved components, and the differing ways to manage them:

Component	Source	Managed through
Operating system	Debian installer/repository	Disk image / apt
Kubernetes	K3S install script	Script
NVIDIA drivers	Driver install script	Script
NVIDIA Container Toolkit	k8s operator	Helm
Steam (in container)	Steam repository	apt
32-bit libraries (in container)	Debian repository	apt
OCI image	Docker	Dockerfile

Let’s think about some of the things that could go wrong:

The docker:12.10 tag could change and the image build breaks in CI
A Steam update could require another 32-bit library that I need to hunt down
Debian’s repository could update and break any of the packages I rely on
Valve could stop packaging .deb’s or providing a repository
I have to consciously note and handle GPU driver versions, Toolkit versions, etc

And what if I want to reinstall the OS? I’ll have to setup the nodes from scratch again - and we’re covering only a minority of the setup required in this post!

Seriously, if you’d have offered me $20 to delete my VM snapshots and run a apt update && apt upgrade on my recorder nodes, I probably would have declined. That’s how flaky the entire setup used to be.

Emergent Confidence from Determinism

Lets have a look at how to solve these problems from a Nix-centric view.

But what even is Nix? If you haven’t had hands-on experience with Nix this is likely to be confusing, as people can mean different or multiple things when they just say “Nix”:

The Nix language, a functional and domain-specific language for building (as in, compiling and packaging) software
nixpkgs, a repository of Nix code that defines derivations (“blueprints” for building specific software) among other things
NixOS, a Linux distribution that is declaratively configured with Nix

Here is an example of a derivation (that builds lazygit) in the nixpkgs repository. Here’s an example of a simple configuration.nix used to configure a NixOS system that sets the hostname, timezone, keyboard layout, creates a user, etc. Both of these examples are in the Nix language.

So lets try to solve the same issues using Nix, nixpkgs, and Kubernetes nodes running NixOS.

Solving Hardware Acceleration with Nix

We can install the NVIDIA drivers on our nodes with these NixOS configuration.nix options:

hardware.graphics.enable = true;
 
hardware.nvidia = {
	modesetting.enable = true;
	open = false; # nodes have Pascal cards, no open kernel modules
	package = config.boot.kernelPackages.nvidiaPackages.stable;
};

And then install the NVIDIA Container Toolkit (which also needs a k8s device plugin):

hardware.nvidia-container-toolkit.enable = true;
hardware.nvidia-container-toolkit.mount-nvidia-executables = true;

… and install our display manager and desktop environment (GNOME this time around):

services.displayManager.gdm.enable = true;
services.desktopManager.gnome.enable = true;

Solving Steam with Nix

Nix can do more than just build system closures, it can also build OCI images!

Now, nixpkgs gives us derivations that package software, and Nix can build OCI images. Can I just say “give me an image with Steam and 32-bit GPU drivers in it?” - yes!

striker-krecorder-base = pkgs.dockerTools.buildImage {
  name = "striker-krecorder-base";
  tag = "latest";
 
  copyToRoot = with pkgs; [
    # get 32-bit mesa libraries
    pkgsi686Linux.mesa
 
    # get 32-bit nvidia libraries
    (linuxKernel.packages.linux_zen.nvidia_x11_production.override {
      libsOnly = true;
      acceptLicense = true;
    }).lib32
 
	# install steam
    steam
  ];
 
  runAsRoot = ''
    # copy 32-bit drivers into /run/opengl-driver-32
    mkdir -p /run/opengl-driver-32
    cp -rs ${pkgs.linuxKernel.packages.linux_zen.nvidia_x11_production.lib32}/* \
       /run/opengl-driver-32
    cp -rs ${pkgs.pkgsi686Linux.mesa}/* /run/opengl-driver-32
  '';
};

Wow. So that messy Dockerfile earlier, the manual setup of the desktop environment, NVIDIA drivers & container toolkit, and the entire NVIDIA GPU operator was all replaced by… 40 lines of Nix code and a Kubernetes device plugin?

Let’s recreate the matrix I made earlier, but for Nix:

Component	Source	Managed through
Operating system	nixpkgs (initially installer)	Nix
Kubernetes	nixpkgs	Nix
NVIDIA drivers	nixpkgs	Nix
NVIDIA Container Toolkit	nixpkgs	Nix
Steam (in container)	nixpkgs	Nix
32-bit libraries (in container)	nixpkgs	Nix
OCI image	nixpkgs	Nix

In fact, there’s even one dimension this table doesn’t reflect. While all these components are instantiated through nixpkgs, they are in fact all instantiated through the exact same version of nixpkgs, to the same git commit. This is enabled through the use of flakes which, among other things, acts as a lock file for your inputs (in this case, nixpkgs).

So to return to the question in the intro; what if the context is large or changes often? Seems to me you have two options; keep the system and knowledge about the system (documentation) separate and pray they don’t drift too drastically rendering the latter useless. Alternatively, commit to Infrastructure as Code principles and adopt tools like Nix which act as both.

The result of incorporating Nix in my project is complete confidence in making changes and experimenting, as the entire stack that supports the recorder component, from operating system to container image, is declarative, version controlled, and based on deterministic build artifacts. Using deployment tools like Colmena I can even set up new recorder nodes with (effectively) a single command.

I think that’s pretty exciting :)

Additional reading

Building Dockerfiles with Nix by Mitchell Hashimoto

Which is an endless rabbit-hole in and of itself. I’ll probably write about it some day. ↩
Passing through the X11 socket also involves making the container privileged, sharing host IPC, and increasing the shared memory (shmem) size. ↩
Other tweaks include auto-login, disabling authorization with xhost +, some kind of remote desktop for debugging ↩

runie.dev

Recent Posts