There are lots of good reasons for and articles recommending running Docker containers read-only, but what I have a difficult time finding are descriptions of how to do this for many popular images. Some software needs to write to a few important and predictable locations. It surprises me how often image providers neglect to offer instructions or details required to run their image this way.
Even setting aside read-only containers, counting on writing to the writable layer just feels wrong. Per the documentation, for the writable layer, both read and write speeds are lower because of the copy-on-write/overlay process through the storage driver. In my experience, docker diff
output means I haven’t taken the time to configure my volume declarations, either through tmpfs mounts, volumes, or bind mounts.
For many images, simply creating tmpfs mounts for /run
and /tmp
are enough. For others, a more careful analysis is required. In any case, I wanted to share my experiences with turning a popular image into a read-only container.
First, let’s start with an image I use a lot: nginx. Even the engineers from nginx at DockerCon this year had no idea what would be required to run their Docker images read-only. I decided that very evening to figure this out, and it wasn’t really that hard.
A great and quick way to see what filesystems an image writes to is docker diff
on a running container. With all logging (at least to log files) disabled and no proxy_cache setup, I still saw nginx writing to /var/run
, /var/cache/nginx
, and /tmp
, and most of these writes were tiny files (like PID files, temporary files, and buffers). In most cases, it would be ideal to consolidate these or simply disable them. I couldn’t find a way to do that for these, but I was able to get a good sense of how much space was used. These values may vary, so please be sure to actually test your application if you choose to use my values.
In any case, I relied on tiny tmpfs volumes for these, but using any type of Docker volume should do. Here is an example from a docker-compose.yml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
version: '2' volumes: var-run: #per_container: true driver_opts: type: tmpfs device: tmpfs o: "size=1m,noexec,nosuid,nodev" var-cache-nginx: #per_container: true driver_opts: type: tmpfs device: tmpfs o: "size=16m,noexec,nosuid,nodev" tmp: #per_container: true driver_opts: type: tmpfs device: tmpfs o: "size=8m,noexec,nosuid,nodev" services: app: # 256M memory limit mem_limit: 268435456 image: my.registry.host/image/name:tag volumes: - var-run:/var/run - var-cache-nginx:/var/cache/nginx - tmp:/tmp read_only: true |
Notice I have a few comments in there to explain the settings. I also have per_container: true
but also have it commented out; this is a Rancher-specific volume setting but it is worth noting. Since tmpfs resides in-memory on a given host, if your orchestration tool is managing multiple hosts, you may have high-availability or capacity issues. This is because a service will share a volume across all containers for that service. For host-specific volumes like tmpfs, that means all containers for this service will run on the same host.
Given the memory restriction on the service (256MB) and the maximum size set for the tmpfs volumes (25MB combined), this means that the containers for this service can use a maximum of 281MB of memory each. That, and they are read-only other than these temporary filesystems that are severely restricted (no executables and not a lot of space and no devices so bind-mounting them should be difficult) and reset when the container restarts.
Clearly this isn’t the end of security tuning for a public-facing web server, but this isn’t a bad start. Also, don’t worry Kubernetes users, you can do the same thing in a Pod spec.