Introduction to using DVC to manage machine learning project datasets

By: (plus.google.com) +David Herron; Date: July 7, 2019

Tags: Machine Learning

DVC is a powerful set of tools for managing data files associated with data science or machine learning projects. The code for such a project is committed to a Git repository, and DVC manages the data files in parallel to that repository.

DVC is written in Python, so your computer must of that installed. It runs on Windows, Linux or Mac OS X. We will go through setting up a simple prebaked DVC project to explore its capabilities. But first we must install the software.

Installing DVC on Linux, Mac OS X or Windows

The basic installation instruction is:

$ pip install dvc

These methods can be used if you prefer to use one of these package management systems.

It can be installed from a deb repository:

$ sudo wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list
$ sudo apt-get update
$ sudo apt-get install dvc

From an rpm repository:

$ sudo wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo
$ sudo yum update
$ sudo yum install dvc

Or via Homebrew (Mac OS X) via one of these mechanisms:

$ brew install iterative/homebrew-dvc/dvc
...OR...
$ brew cask install iterative/homebrew-dvc/dvc

Whichever method you used, it's helpful to verify the installation worked. These two commands will be useful:

$ dvc help
.... prints help output including the available commands
$ dvc --version
.... prints the version you installed

Setting up the example project

Now that we have DVC installed let's set up a sample project.

The DVC team has created a Getting Started example that we can install. It is an text classifier example. The advantage here is is quick to set up, and doesn't take up much space or execution time, while still being real machine learning code.

This is also an example of how easy DVC makes it to share workspaces with colleagues.

$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ virtualenv -p python3 .env (optional)
$ source .env/bin/activate   (optional)
$ pip install -r requirements.txt
$ dvc pull

You should of course already have Python, Git, virtualenv and pip installed on your computer.

This downloads source code from a Git repository. In the source tree is a script we can use to install dependencies like Pandas or scikit-learn. The last step, dvc pull, downloads a prebaked cache directory from a remote cache. As we said, this demonstrates how easy DVC makes it to share both the code and the dataset with colleagues, or to otherwise deploy a project into any environment you like.

The source tree looks like so:

$ tree
.
├── README.md
├── auc.metric
├── data
│   ├── data.xml
│   ├── data.xml.dvc
│   ├── features
│   │   ├── test.pkl
│   │   └── train.pkl
│   └── prepared
│       ├── test.tsv
│       └── train.tsv
├── evaluate.dvc
├── featurize.dvc
├── model.pkl
├── prepare.dvc
├── requirements.txt
├── src
│   ├── evaluate.py
│   ├── featurization.py
│   ├── prepare.py
│   └── train.py
└── train.dvc

4 directories, 18 files

DVC maintains a hidden directory containing a configuration file (.dvc/config), the aforementioned cache, and some other data. One of the settings in this example workspace is the URL for a prebaked DVC remote cache. DVC supports both local and remote cache’s, the latter being used for sharing data with colleagues.

There are a pair of commands, dvc push and dvc pull, that are analogous to the git push and git pull commands. With dvc push we send data to a remote cache. With the dvc pull command we download data from a remote cache. In this case, we end up with this directory structure:

$ tree .dvc
.dvc
├── cache
│   ├── 38
│   │   └── 63d0e317dee0a55c4e59d2ec0eef33
│   ├── 42
│   │   ├── c7025fc0edeb174069280d17add2d4.dir
│   │   └── c7025fc0edeb174069280d17add2d4.dir.unpacked
│   │       ├── test.pkl
│   │       └── train.pkl
│   ├── 58
│   │   └── 245acfdc65b519c44e37f7cce12931
│   ├── 68
│   │   ├── 36f797f3924fb46fcfd6b9f6aa6416.dir
│   │   └── 36f797f3924fb46fcfd6b9f6aa6416.dir.unpacked
│   │       ├── test.tsv
│   │       └── train.tsv
│   ├── 9d
│   │   └── 603888ec04a6e75a560df8678317fb
│   ├── a3
│   │   └── 04afb96060aad90176268345e10355
│   ├── aa
│   │   └── 35101ce881d04b41d5b4ff3593b423
│   └── dc
│       └── a9c512fda11293cfee7617b66648dc
├── config
├── lock
├── state
├── updater
└── updater.lock

Now that we have learned a little about DVC, and have a working DVC project directory, let’s explore how DVC efficiently deals with the local cache directory.

Associating data files with Git commits and experiments

The workspace we checked out has a number of tags describing each commit.

$ git tag
0-empty
1-initialize
2-remote
3-add-file
4-sources
5-preparation
6-featurization
7-train
8-evaluation
9-bigrams
baseline-experiment
bigrams-experiment

The first several concern setting up the project, and the last couple tags are obviously different experiments with the machine learning model.

We can easily revisit the state of the workspace at any of these tags. The first step is obvious:

$ git checkout 9-bigrams

But if you consult the workspace, the source files will have changed (src/featurization.py) but the data files will not have changed. DVC provides a command to update which data files are checked out:

$ dvc checkout

What happens is that the files with filenames ending in .dvc contain the current list of files. When we checkout a different Git commit, this changes which instance of each .dvc file is checked out. The dvc checkout command goes through the .dvc files, and retrieves from the DVC cache the correct instance of each file.

One way to see the difference between tags is with this command:

$ git diff baseline-experiment bigrams-experiment

In the diff output you’ll see that src/featurization.py was changed, but so too were all the .dvc files. As we use git checkout to navigate through the tags in the project, the .dvc refer to different instances of each file in the DVC cache. The dvc checkout command then synchronizes the checked out data files.

As a convenience this command installs git hook’s that automatically execute dvc checkout every time git checkout is run:

$ dvc install

The first time we used git checkout, Git printed a message about being in a “detached head” state. It is easy to return to the normal state with this command:

$ git checkout master

While in a detached head state we are able to create a branch based to contain any work we want to do. This (stackoverflow.com) Stack Overflow answer has some good advice about that.

The advice there also applies to a DVC project. When dvc checkout is run, DVC does not know that the workspace is in a Git branch. It just knows that the .dvc files list the files that are to be checked out. That means the git checkout and dvc checkout combination can be used for any Git tag, commit, or branch.