Containerization in Practice

Part Two: How a Data Science Project Can Benefit from Docker?

Anastasia Lebedeva
Analytics Vidhya

--

photo from https://www.pexels.com/@cottonbro

The post suggests how you can benefit from Docker on different stages of a data science project from research to deploying to data and application monitoring in production. In particular, I discuss advantages as well as aspects to consider when applying Docker technology to

  • Examine an existing solution
  • Prepare development environment
  • Create a test environment (including a test database)
  • Monitor your data and application
  • Deploy your project
  • Share your project

Examine an existing solution

A new data science project typically starts with examining the field and testing out existing solutions regarding their feasibility to the stated problem. With Docker, research outcomes are easily shared and applied to other problems and data.

There is plenty of ready-to-use docker images for machine learning tools and algorithms. To verify this claim, I reviewed the documentation of the most popular open source ML projects of the year 2018 announced by Mybridge in this article. Indeed, 13 of 32 projects (40%) have a docker file or a docker image available. Note that I excluded projects available as packages from both categories.

That is, algorithms such as Mask R-CNN, Deepvariant, and DensePose are just several commands away. Application of those tools is as easy as (1) installing Docker Desktop on your machine, (2) building or pulling the docker image and (3) running a container. The exact commands and their arguments are provided by the tool authors. Further, you may integrate it into your project using libraries for docker engine API, which are available for many programming languages (e.g. for Python, Java or .NET).

Prepare development environment

Often, an environment for your project can be created and even accessed using a dependency management tool such as Miniconda. In this case, it is enough to keep track of dependencies carefully. Having a development environment set up with Docker might help when your project requires a combination of multiple runtimes (like Python and Java simultaneously) or when different stages of your pipeline depend on non-compatible dependencies.

But even in this case, consider developing components independently and isolatedly using, for instance, virtual environments, and utilizing your favorite IDE and all benefits of it. And only for testing and deploying purposes consider to address Docker. In general, using Docker to set up a development environment will slow things down, make debugging (and development in general) more complicated, will require additional management, etc.

photo from https://www.pexels.com/@cottonbro

Create a test environment

Consider using Docker to create a test environment and/or a test database for your project.

Test database

Often, input data for your solution are stored in a database hosted elsewhere. With Docker, you can easily simulate a production database, fill it with the test data, and run it as an independent server. This will allow you to test out the complete pipeline including reading/writing from/to a database. There are official images for PostgreSQL, MySQL, MongoDB, Redis and many other databases publicly available on Docker hub.

Two aspects to consider:

  • You may fill the database as part of each test, as best practice suggests, or fill it up once on creation, e.g. from a CSV prepared earlier. Anyway, remember that the database, and hence your data, will be gone when the database container stops (read more about data persistence here).
  • Remember that the goal is to create an environment as similar to production as possible. Therefore, use the same tool you apply for database migration in production to create the test database schema.

Test environment

Execution of tests in a docker container has several benefits:

  • It makes your testing pipeline portable and easy to apply. For instance, you may create, modify and run tests on your local machine and, using merely modified commands, as part of a CI/CD pipeline.
  • It allows you to imitate an environment nearly identical to the production.
  • Moreover, it allows you to test against multiple different environments (e.g. Ubuntu and Windows).

A docker image (or multiple images) then will be defined similarly as you define the development/production environment (i.e. using the same configs, same requirement files, etc.). When executed, the container will run a script launching tests.

There are several aspects to consider:

  • It is more convenient to volume source codes of your project and the test project within the run command, i.e. when launching the container. This way you don’t have to build a new image after each project or test sources update.
  • Make sure your pipeline distinguishes between test/dev/prod and addresses appropriate data sources (e.g database container created earlier) and other environment-specific configs.

Monitor your data and application

Docker hub contains official images for a variety of software which might be beneficial for your ML project. For instance, there are images for tools like Grafana and Kibana which can be used to inspect and visualize the state of data or/and application.

With Docker, those applications could be launched in seconds on your local machine or elsewhere. Using these tools you have a greater overview of data and logs. You can effortlessly create visualizations, list outliers, track times and exceptions, access system or data state and many more. This might be of an enormous benefit for a data scientist, for whom the priority is to ensure the system inputs and outputs are well understood and are under control.

Deploy your project

Docker extremely simplifies deploying of applications. Even if your development environment was outside of a container (which might be often the case as we discussed earlier), it worth it to containerize your application for deployment.

The container on the development machine will work exactly the same way on testing, staging, production, or any other environment. No further installation (except for the docker engine) will be required when launching the application on any server. Moreover, many cloud providers implement multiple options for the extremely easy deploy and management of containerized applications.

Share your project

The same way you were playing with existing solutions at the beginning of the project allows sharing your discoveries with the community. After packing your application in a container, you may post the docker image into a public repository (such as Docker hub), or provide a raw docker file allowing users to build an image themselves having the source codes. This way anyone having Docker installed will be able to test out and even integrate your tool within a few minutes.

photo from https://www.pexels.com/@cottonbro

To conclude, Docker is a very powerful tool a data science project can benefit from in many cases and on various stages. When applied properly, it not only boosts developers' efficiency but brings a more trustworthy way to approach data science.

In the next post, let's take a look at how to bring such a project to live by deploying it to AWS while selecting the most appropriate services the provider enables for containerized application.

A couple of useful links:

--

--