Fork me on GitHub

Cookiecutter Docker Science

Build Status

Data scientists do many machine learning or data mining tasks. For such data engineering tasks, researchers apply various tools and system libraries, which are constantly updated. Unfortunately installing and updating them cause problems in local environments. Even when we work in hosting environments such as EC2, we are not free from this problem. Some experiments succeeded in an instance but failed in another one, since library versions of each EC2 instances could be different.

By contrast, we can creates the identical environments with Docker in which needed tools with the correct versions are installed without changing system libraries in host machines.

This aspect of Docker is important for reproducibility of experiments, and keep the projects in continuous integration systems.

However, working in a Docker containers is troublesome. Adding a new library into requirements.txt or Dockerfile does not installed as if a local machine. Specifically we need to create a Docker image and the container each time. We also need to add port forward settings to see server responses such as Jupyter Notebook UI launch in Docker container from our local environments. Cookiecutter Docker Science provides utilities to make working in Docker container simple.

This project is a tiny template for machine learning projects developed in Docker environments. In machine learning tasks, projects glow uniquely to fit target tasks, but in the initial state, most directory structure and targets in Makefile are common. Cookiecutter Docker Science generates initial directories which fits simple machine learning tasks.

Cycle of a project

Generally machine learning projects consist of three phases (Experiments, Code simplification, and Deployment). The following is the image of the cycle of a machine learning project.

In the begging we do experiments in Jupyter Notebook interactively. Then we simplify the code written in the notebooks. In this step, add test, refactor code, apply linter, make library and CI. After we finished the code simplification, we deploy the model to production use. In this phase, we add batch scripts or service using the library created at the previous step. After deployment, we continue the improvement of model in Jupyter Notebook again.

Work in Docker container

Ideally, we should work in a Docker container in every step of the machine learning projects generated by the same Dockerfile, since we can start code simplification and deployment seamlessly. The following is the image.

Cookiecutter Docker Science supports the machine learning project from the experiments to deployment with Docker container.

Usage

This section describes the usage of Cookiecutter Docker Science.

Requirements

Cookiecutter Docker Science need the followings.

Generate new project

To generate project with the Cookiecutter Doccker Science template, please run the following command.

$cookiecutter git@github.com:docker-science/cookiecutter-docker-science.git

Then the cookiecutter asks for several questions on generated project as follows.

$cookiecutter git@github.com:docker-science/cookiecutter-docker-science.git
project_name [project_name]: food-image-classification
project_slug [food_image_classification]:
jupyter_host_port [8888]:
description [Please Input a short description]: Classify food images into several categories
data_source [Please Input data source in S3]: s3://research-data/food-images

The following movie shows the generation of a project with Cookiecutter Docker Science.

Create Docker Container

The following movie is a sample to create a Docker image and container and then launch Jupyter Notebook in the container.

Directory structure

When we generate a project with Cookiecutter Docker Science, the project has the following files and directories.

├── Makefile                          <- Makefile contains many targets such as create docker container or
│                                        get input files.
├── config                            <- This directory contains configuration files used in scripts
│   │                                    or Jupyter Notebook.
│   └── jupyter_config.py
├── data                              <- data directory contains the input resources.
├── docker                            <- docker directory contains Dockerfile.
│   └── Dockerfile                    <- Dockerfile have the container settings. Users modify Dockerfile
│                                        if additional library is needed for experiments.
├── model                             <- model directory store the model files created in the experiments.
├── my_data_science_project           <- cookie-cutter-docker-science creates the directory whose name is same
│   │                                    as project name. In this directory users puts python files used in scripts
│   │                                    or Jupyter Notebook.
│   └── __init__.py
├── notebook                          <- This directory sotres the ipynb files saved in Jupyter Notebook.
├── requirements.txt                  <- Libraries needed to run experiments. The library listed in this file
│                                        are installed in the Docker container.
├── scripts                           <- Users add the script files to generate model files or run evaluation.
└── tests                             <- tests directory stores test codes and the fixture files.

Consistent support

As described in the previous section, machine learning projects consist of three phases (experiments, code simplification, and deployment). Cookiecutter Docker Science supports machine learning projects in all three phases.

Experiments

Projects start from exploratory experiments. This phase data scientists analysis given data in Jupyter Notebook interactively. Cookiecutter Docker Science launches the Jupyter Notebook server in the container with make jupyter command. Note that port forward setting to connect the port Docker container to host PC is setup by Cookiecutter Docker Science.

Users create notebooks in notebook directory and store the results of analysis.

Code simplification

In the code simplification phase, users extract components used in the experiments to the library directory. The library codes are stored in the library directory (commonly project name directory) generated by Cookiecutter Docker Science.

Then software engineers refactor codes and add tests. The test files are stored in tests directory. The flows to generate the model files are added in Makefile as the targets. Adding the make targets we can generate the model with just make command.

Deployment

After code simplification phase, we deploy the model to the service or batch script. We can make use of the project since the directories in the projects are the same and we are able to generate model with the make command.

When we use the project as a library, Cookiecutter Docker Science provide setup.py to upload the library to (internal) pypi repository.

License

Apache version 2.0

Links