Python3 6 Docker

Building smaller Python Docker images

Changes are afoot at Zeit Now, my preferred hosting provider for the past year (see previous posts). They have announced Now 2.0, an intriguing new approach to providing auto-scaling immutable deployments. It’s built on top of lambdas, and comes with a whole host of new constraints: code needs to fit into a 5MB bundle for example (though it looks like this restriction will soon be relaxed a little—update November 19th you can now bump this up to 50MB).

Unfortunately, they have also announced their intent to deprecate the existing Now v1 Docker-based solution.

“We will only start thinking about deprecation plans once we are able to accommodate the most common and critical use cases of v1 on v2”—Matheus Fernandes

Mar 19, 2021 Base Docker image for ubuntu-16.04 & Python3.6. GitHub Gist: instantly share code, notes, and snippets. I tried to install a specific version of Python (3.6.5), it works a few weeks back and today when I rebuild the image, it gives 3.6.6. Anyone having this problem as well? Ubuntu (host) version: Ubuntu 18.04.1 LTS. Docker version: 18.06.1-ce, build e68fc7a. Here's my Dockerfile.

“When we reach feature parity, we still intend to give customers plenty of time to upgrade (we are thinking at the very least 6 months from the time we announce it)”—Guillermo Rauch

Python 3.6 download

This is pretty disastrous news for many of my projects, most crucially Datasette and Datasette Publish.

Datasette should be fine—it supports Heroku as an alternative to Zeit Now out of the box, and the publish_subcommand plugin hook makes it easy to add further providers (I’m exploring several new options at the moment).

Datasette Publish is a bigger problem. The whole point of that project is to make it easy for less-technical users to deploy their data as an interactive API to a Zeit Now account that they own themselves. Talking these users through what they need to do to upgrade should v1 be shut down in the future is not an exciting prospect.

So I’m going to start hunting for an alternative backend for Datasette Publish, but in the meantime I’ve had to make some changes to how it works in order to handle a new size limit of 100MB for Docker images deployed by free users.

Building smaller Docker images

Zeit appear to have introduced a new limit for free users of their Now v1 platform: Docker images need to be no larger than 100MB.

Datasette Publish was creating final image sizes of around 350MB, blowing way past that limit. I spent some time today figuring out how to get it to produce images within the new limit, and learned a lot about Docker image optimization in the process.

I ended up using Docker’s multi-stage build feature, which allows you to create temporary images during a build, use them to compile dependencies, then copy just the compiled assets into the final image.

An example of the previous Datasette Publish generated Dockerfile can be seen here. Here’s a rough outline of what it does:

Python3 6 DockerUbuntuPython3
  • Start with the python:3.6-slim-stretch image
  • apt-installs python3-dev and gcc so it can compile Python libraries with binary dependencies (pandas and uvloop for example)
  • Use pip to install csvs-to-sqlite and datasette
  • Add the uploaded CSV files, then run csvs-to-sqlite to convert them into a SQLite database
  • Run datasette inspect to cache a JSON file with information about the different tables
  • Run datasette serve to serve the resulting web application

There’s a lot of scope for improvement here. The final image has all sorts of cruft that’s not actually needed for serving the image: it has csvs-to-sqlite and all of its dependencies, plus the original uploaded CSV files.

Here’s the workflow I used to build a Dockerfile and check the size of the resulting image. My work-in-progress can be found in the datasette-small repo.

Alpine Linux

When you start looking for ways to build smaller Dockerfiles, the first thing you will encounter is Alpine Linux. Alpine is a Linux distribution that’s perfect for containers: it builds on top of BusyBox to strip down to the smallest possible image that can still do useful things.

The python:3.6-alpine container should be perfect: it gives you the smallest possible container that can run Python 3.6 applications (including the ability to pip install additional dependencies).

There’s just one problem: in order to install C-based dependencies like pandas (used by csvs-to-sqlite) and Sanic (used by Datasette) you need a compiler toolchain. Alpine doesn’t have this out-of-the-box, but you can install one using Alpine’s apk package manager. Of course, now you’re bloating your container with a bunch of compilation tools that you don’t need to serve the final image.

This is what makes multi-stage builds so useful! We can spin up an Alpine image with the compilers installed, build our modules, then copy the resulting binary blobs into a fresh container.

Here’s the basic recipe for doing that:

This pattern works really well, and produces delightfully slim images. My first attempt at this wasn’t quite slim enough to fit the 100MB limit though, so I had to break out some Docker tools to figure out exactly what was going on.

Inspecting docker image layers

Part of the magic of Docker is the concept of layers. When Docker builds a container it uses a layered filesystem (UnionFS) and creates a new layer for every executable line in the Dockerfile. This dramatically speeds up future builds (since layers can be reused if they have already been built) and also provides a powerful tool for inspecting different stages of the build.

When you run docker build part of the output is IDs of the different image layers as they are constructed:

Given a layer ID, like 971a5d5dad01, it’s possible to spin up a new container that exposes the exact state of that layer (thanks, Stack Overflow). Here’s how do to that:

The -it argument attaches standard input to the container (-i) and allocates a pseudo-TTY (-t). The -rm option means that the container will be removed when you Ctrl+D back out of it. sh is the command we want to run in the container—using a shell lets us start interacting with it.

Now that we have a shell against that layer, we can use regular unix commands to start exploring it. du -m (m for MB) is particularly useful here, as it will show us the largest directories in the filesystem. I pipe it through sort like so:

Straight away we can start seeing where the space is being taken up in our image.

Deleting unnecessary files

It turned out the uvloop folder still contained a bunch of files that were used as part of the compilation, including a 6.7MB loop.c file and a bunch of .pxd and .pyd files that are compiled by Cython. None of these files are needed after the extension has been compiled, but they were there, taking up a bunch of precious space.

So I added the following to my Dockerfile:

Python 3.6 Docker Image

Then I noticed that there were __pycache__ files that weren’t needed either, so I added this as well:

(The -delete flag didn’t work correctly for that one, so I used xargs instead.)

This shaved off around 15MB, putting me safely under the limit.

Running csvs-to-sqlite in its own stage

csvs-to-sqlite has some pretty heavy dependencies of its own in the form of Pandas and NumPy. Even with the build chain installed I was having trouble installing these under Alpine, especially since building numpy for Alpine is notoriously slow.

Then I realized that thanks to multi-stage builds there’s no need for me to use Alpine at all for this step. I switched back to python:3.6-slim-stretch and used it to install csvs-to-sqlite and compile the CSV files into a SQLite database. I also ran datasette inspect there for good measure.

Then in my final Alpine container I could use the following to copy in just those compiled assets:

Tying it all together

Here’s an example of a full Dockerfile generated by Datasette Publish that combines all of these tricks. To summarize, here’s what it does:

Python 3.6 Docker Image

  • Spin up a python:3.6-slim-stretch—call it csvbuilder
    • apt-get install -y python3-dev gcc so we can install compiled dependencies
    • pip install csvs-to-sqlite datasette
    • Copy in the uploaded CSV files
    • Run csvs-to-sqlite to convert them into a SQLite database
    • Run datasette inspect data.db to generate an inspect-data.json file with statistics about the tables. This can later be used to reduce startup time for datasette serve.
  • Spin up a python:3.6-alpine—call it buildit
    • We need a build chain to compile a copy of datasette for Alpine Linux…
    • apk add --no-cache gcc python3-dev musl-dev alpine-sdk
    • Now we can pip install datasette, plus any requested plugins
    • Reduce the final image size by deleting any __pycache__ or *.c, *.pyd and *.pxd files.
  • Spin up a fresh python:3.6-alpine for our final image
    • Copy in data.db and inspect-data.json from csvbuilder
    • Copy across /usr/local/lib/python3.6 and /usr/local/bin/datasette from bulidit
    • … and we’re done! Expose port 8006 and set datasette serve to run when the container is started

Now that I’ve finally learned how to take advantage of multi-stage builds I expect I’ll be using them for all sorts of interesting things in the future.

Pycharm

Posted 19th November 2018 at 3:13 am · Follow @simonw on Twitter