Managing versioned machine learning datasets in DVC, and easily share ML projects with colleagues

; Date: Tue Jul 23 2019

Tags: Machine Learning

DVC is a powerful set of tools for managing data files associated with data science or machine learning projects. It works hand-in-hand with a Git repository to track both the code and the datasets in an ML project. A core feature is for versioning datasets, meaning that it correlates the dataset to exactly match what existed at each Git commit. By using a DVC "remote cache" it is very easy to share a project with colleagues, or to copy the dataset to a remote machine.

In this tutorial we will go over a simple image classifier. We will learn how DVC works in a machine learning project, how it optimizes reproducing results when the project is changed, and how to share the project with colleagues.

Thank's to DVC's feature-set all those things are very easily accomplished.

DVC is written in Python, so your computer must of that installed. It runs on Windows, Linux or Mac OS X. We will go through setting up a simple prebaked DVC project to explore its capabilities. But first we must install the software.

For installation instructions see the first part of Introduction to using DVC to manage machine learning project datasets -- then come back here for the rest of this tutorial.

Setting up the example project

This tutorial is expanded from an existing Get Started tutorial on the DVC website. Let’s quickly go through its setup.

$ git clone https://github.com/iterative/example-versioning.git
$ cd example-versioning/
$ pip install -r requirements.txt
$ wget https://dvc.org/s3/examples/versioning/data.zip
$ unzip data.zip
$ rm -f data.zip

What we've done so far is to clone a Git repository, then download the first half of the dataset. The data.zip archive contains prelabeled pictures of cats and dogs, and the code in the Git repository contains an image classifier algorithm that will be trained to distinguish cats from dogs.

Our directory contains this:

$ ls -a
.   .dvc    .gitignore  data             train.py
..  .git    README.md   requirements.txt

Besides the train.py training script, and the data directory, there is a .git directory containing the Git repository, and a .dvc directory. This directory is used by DVC for its housekeeping, which we'll learn about during the course of this tutorial.

For now let's do this:

$ dvc add data

This adds the data directory to the DVC cache. The DVC cache is within the .dvc directory, and contains a mirror of the files in the workspace using MD5 checksums to track the files. If a file changes over the course of a project, the MD5 checksum will differ, and DVC will end up storing two instances of the file in its cache, each indexed by the corresponding MD5 checksum.

The "add data" command told us this:

Saving 'data' to '.dvc/cache/b8/f4d5a78e55e88906d5f4aeaf43802e.dir'.

From the file name, that's obviously in the cache directory. You'll see that a MD5 checksum is part of the file name.

We were also told to do this:

git add .dvc/.gitignore .gitignore data.dvc

We have two systems operating in this directory - Git is being used to store the source code files, and the files ending with .dvc. Those files, the DVC files, contain the MD5 checksums that DVC uses to correlate the files in the cache with files that are supposed to be in the workspace.

The other system is DVC, whose data is in the .dvc directory. It works in concert with Git, and it manages the files which are impractical to manage using Git. Git does not deal well with large binary files, and that's where DVC comes in. DVC adds an entry to appropriate .gitignore files for each file that DVC is tracking.

During the course of a project you'll naturally change some of the data files. DVC keeps the master copy of each instance of each file in the DVC cache, using the MD5 checksum as its file name. It then links the correct instance of each file into the workspace using the file name we think it has.

The next thing to do is to run the script to train the model. We run it like so:

$ dvc run -f Dvcfile \
          -d train.py -d data \
          -M metrics.json \
          -o model.h5 -o bottleneck_features_train.npy \
          -o bottleneck_features_validation.npy \
          python train.py

This command informs DVC of a processing step we want to use in this project. This says to use Python to run train.py, with input files named using the -d options, and output files named using the -o options. The -M option names an output file which contains a Metric.

We could have run the command by hand, of course. But by running it this way we've told DVC everything about this command, and now DVC can run it for us when needed. For example if train.py is changed, or the contents of the data directory are changed, DVC knows to rerun train.py because its dependencies have changed. "Changed" is detected because the MD5 checksum will have changed.

DVC supports creating a multi-step pipeline if you use dvc run multiple times. For example you might have steps to download a file, unpack the file, process the data to generate usable data, then run training and validation scripts. In this case we have just one script to run, but don't many ML projects have lots of individual steps like that?

The dvc run command goes ahead and executes the named program. And when it's done we're told this:

Adding 'model.h5' to '.gitignore'.
Adding 'bottleneck_features_train.npy' to '.gitignore'.
Adding 'bottleneck_features_validation.npy' to '.gitignore'.
Output 'metrics.json' doesn't use cache. Skipping saving.
Saving 'model.h5' to '.dvc/cache/9a/8fabce95b8a4e94011fdc911030339'.
Saving 'bottleneck_features_train.npy' to '.dvc/cache/da/9e20b12aa5b2dc0abb02e1a1b4e4cf'.
Saving 'bottleneck_features_validation.npy' to '.dvc/cache/e5/48cc847339c990a7dbe0759d87c7c4'.
Saving information to 'Dvcfile'.

To track the changes with git, run:

    git add .gitignore .gitignore .gitignore Dvcfile

Again we have files that have been added to the DVC cache and corresponding updates to the .gitignore files. The file, Dvcfile, is the same format and purpose as the .dvc files in that it tracks the dependencies to the command, the outputs from the command, and the command itself.

At this time we want to record a tag because the first phase of the project is finished.

$ git add .
$ git commit -m "model first version, 1000 images"
$ git tag -a "v1.0" -m "model v1.0, 1000 images"

Adding more images to the dataset - second phase of the project

Suppose you’re not satisfied with the results, and you realize you need a larger set of training and validation images. After locating those images you can add them to the workspace like so:

$ wget https://dvc.org/s3/examples/versioning/new-labels.zip
$ unzip new-labels.zip
$ rm -f new-labels.zip
$ dvc add data
[##############################] 100% Created unpacked dir

Computing md5 for a large number of files. This is only done once.
[##############################] 100% 

WARNING: Output 'data' of 'data.dvc' changed because it is 'modified'

Saving 'data' to '.dvc/cache/21/060888834f7220846d1c6f6c04e649.dir'.

Saving information to 'data.dvc'.

To track the changes with git, run:

    git add data.dvc

The data.dvc was updated because we added files to the data directory.

If you refer back to the train.py command, the data directory is listed as a dependency. Clearly since we've added new images to the dataset, we must retrain the model. DVC makes this easy.

$ dvc status
[##############################] 100% Created unpacked dir

Dvcfile:

    changed deps:

        modified:           data

First, DVC notices that dependencies have changed to the command/stage recorded in Dvcfile.

The dvc status command will tell us that new files have been added. Because we have new files, we can rerun the training script to generate a new model. Because we already told DVC how to run the training script, we can simply use dvc repro (short for reproduce) to re-execute that script.

$ dvc repro
WARNING: assuming default target 'Dvcfile'.
Stage 'data.dvc' didn't change.
WARNING: Dependency 'data' of 'Dvcfile' changed because it is 'modified'.
WARNING: Stage 'Dvcfile' changed.
Reproducing 'Dvcfile'
Running command:
    python train.py
Using TensorFlow backend.
Found 2000 images belonging to 2 classes.
Found 800 images belonging to 2 classes.
Train on 2000 samples, validate on 800 samples
Epoch 1/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.6299 - acc: 0.7725 - val_loss: 0.2764 - val_acc: 0.8900
Epoch 2/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.3847 - acc: 0.8495 - val_loss: 0.3442 - val_acc: 0.8625
Epoch 3/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.3257 - acc: 0.8750 - val_loss: 0.2785 - val_acc: 0.8987
Epoch 4/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.2600 - acc: 0.9045 - val_loss: 0.3147 - val_acc: 0.8975
Epoch 5/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.2631 - acc: 0.9095 - val_loss: 0.3194 - val_acc: 0.8912
Epoch 6/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.2232 - acc: 0.9150 - val_loss: 0.3132 - val_acc: 0.8925
Epoch 7/10
2000/2000 [==============================] - 7s 3ms/step - loss: 0.1915 - acc: 0.9300 - val_loss: 0.4042 - val_acc: 0.8950
Epoch 8/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.1772 - acc: 0.9345 - val_loss: 0.3480 - val_acc: 0.9012
Epoch 9/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.1603 - acc: 0.9435 - val_loss: 0.3801 - val_acc: 0.9037
Epoch 10/10
2000/2000 [==============================] - 6s 3ms/step - loss: 0.1329 - acc: 0.9500 - val_loss: 0.5123 - val_acc: 0.8862
Output 'bottleneck_features_validation.npy' didn't change. Skipping saving.
Output 'metrics.json' doesn't use cache. Skipping saving.
Saving 'model.h5' to '.dvc/cache/da/c1587c45f6ee04d926039a80554e86'.
Saving 'bottleneck_features_train.npy' to '.dvc/cache/39/ce773781a79d63090ad47ce643f9e5'.
WARNING: data 'bottleneck_features_validation.npy' exists. Removing before checkout.
Saving information to 'Dvcfile'.

To track the changes with git, run:

    git add Dvcfile

$ git add Dvcfile 
$ git status -s
M Dvcfile
M data.dvc
M metrics.json

The model (model.h5) has been retrained, and saved to the DVC cache. Notice that it's file name in the cache changed, because its content changed and therefore its MD5 checksum changed.

There were corresponding changes to the files tracked in the Git repository as well. Those must be saved:

$ git commit -m "model second version, 2000 images"
[master 21042e1] model second version, 2000 images
 2 files changed, 11 insertions(+), 11 deletions(-)
$ git tag -a "v2.0" -m "model v2.0, 2000 images"
$ git tag
v1.0
v2.0

We now have two tags in the Git repository, because version 2 of the project is finished.

Demonstrate switching between revisions

What we have now is a little sample machine learning project that has gone through two revisions. What if we decide to take a look at an earlier revision of the project? For example we're interested in tweaking something we tried earlier.

DVC makes it easy to change the dataset to match the Git repository. It does this using the DVC files that have been created, and updated, as we went through the steps. Each instance of each DVC file contains the MD5 checksum of related files.

$ cat Dvcfile 
deps:
- path: train.py
  md5: 78d98f3865c2fcfe1dbe95b738960d0a
- path: data
  md5: 21060888834f7220846d1c6f6c04e649.dir
cmd: python train.py
md5: b2b93f6a4f9728e7378348ae41edc67f
outs:
- cache: true
  metric: false
  path: model.h5
  md5: dac1587c45f6ee04d926039a80554e86
  persist: false
- cache: true
  metric: false
  path: bottleneck_features_train.npy
  md5: 39ce773781a79d63090ad47ce643f9e5
  persist: false
- cache: true
  metric: false
  path: bottleneck_features_validation.npy
  md5: e548cc847339c990a7dbe0759d87c7c4
  persist: false
- cache: false
  metric: true
  path: metrics.json
  md5: 8af9a62abc596ab8b8faa6f3a751096d
  persist: false
wdir: .

You'll see the cmd field records the training command, the deps section lists the files named as dependencies, and the outs section names the output files.

What happens if we use Git to checkout an earlier tag?

$ git checkout v1.0
M	metrics.json
Note: checking out 'v1.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at b96e2b2 model first version, 1000 images
$ dvc checkout
...
[##############################] 100% Checkout finished!

Git tells us that we're in a detached HEAD state. We'll have to remember that for later. The result is that the old version of data.dvc and Dvcfile were checked out, and those files contain different MD5 checksums for everything. Therefore we ran dvc checkout so it would fix up the data files checked out of the DVC cache.

If we look again at Dvcfile we see different checksums.

$ cat Dvcfile 
deps:
- path: train.py
  md5: 78d98f3865c2fcfe1dbe95b738960d0a
- path: data
  md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
cmd: python train.py
md5: 5fcd53385d82ba5e9162e883efb0e094
outs:
- cache: true
  metric: false
  path: model.h5
  md5: 9a8fabce95b8a4e94011fdc911030339
  persist: false
- cache: true
  metric: false
  path: bottleneck_features_train.npy
  md5: da9e20b12aa5b2dc0abb02e1a1b4e4cf
  persist: false
- cache: true
  metric: false
  path: bottleneck_features_validation.npy
  md5: e548cc847339c990a7dbe0759d87c7c4
  persist: false
- cache: false
  metric: true
  path: metrics.json
  md5: 22a409f25e89741b2748cea1ad1b9c30
  persist: false
wdir: .

The dvc checkout command takes a look at these MD5 checksums, and then rearranges the dataset to match the set of files specified by the contents of the DVC files.

$ git checkout v2.0
...
HEAD is now at 21042e1 model second version, 2000 images
$ dvc checkout
...
[##############################] 100% Checkout finished!

$ du -sk data
68276	data

$ git checkout v1.0
HEAD is now at b96e2b2 model first version, 1000 images
$ dvc checkout
...
[##############################] 100% Checkout finished!

$ du -sk data
43804	data

With the workspace checked out to the v2.0 tag, the data directory has 68 MB of data, and with the v1.0 tag it has 43 MB of data. If you remember, with the v1.0 tag we had one set of cat and dog images, and for the v2.0 tag we added some more.

The last thing to do is to simplify checking out a given tag.

$ git checkout master
$ dvc checkout
$ dvc install

Remember that having checked out these tags, that the workspace was in a "detached head" state. The cure for the detached head state is to run "git checkout master", after which we were required to follow up with "dvc checkout".

The last command, "dvc install" installs Git hooks so that the "dvc checkout" command is automatically run.

Demonstrate sharing a machine learning project with colleagues, or deploying it to a remote system

The last thing we want to go over is the method for copying a DVC workspace from one location to another, and for copying revisions from one DVC workspace to another. We might be wanting to share the workspace with a colleague, perhaps for collaboration, or we might want to deploy it to a cloud server for heavy-duty computation.

Remember there are two parts to this workspace. The Git repository, and the DVC files including the DVC cache.

The process to move this from one place to another is fairly easy, but a little more than simply copying the directory structure to another place.

The result of the following process is that

  1. We have a master Git repository, and master DVC cache
  2. The workspace we've used so far on the laptop is configured with both these repositories
  3. A new workspace is setup on a remote machine also configured with both these repositories
  4. Either workspace can be used to push data or code into both these repositories
  5. Either workspace can be used to pull data or code from both these repositories

The first thing we must do is to upload the dataset to a remote DVC cache.

$ dvc remote add default ssh://david@nuc2.local:/home/david/dvc-remote/example-versioning
$ dvc config core.remote default
$ git status -s
M .dvc/config
$ git add .dvc/config 
$ git commit -a -m 'define remote'
Pipeline is up to date. Nothing to reproduce.
[master 7d7ecda] define remote
 1 file changed, 2 insertions(+)

Here we set up a DVC remote so we can push the dataset to a remote server. DVC supports several remote file protocols for this, such as Amazon S3. An SSH server is very easy to set up, simply create a directory on the remote system and then specify the SSH URL as shown here.

In my case the "remote" machine is an Intel NUC sitting on my desk next to my laptop which serves various purposes.

A remote DVC cache is similar to the local DVC cache we've been using. The role of a remote cache is that we can push data to the remote cache, or pull data from the remote cache. Therefore it's easy to share the dataset with others via a remote cache.

$ dvc push
Preparing to upload data to 'ssh://david@nuc2.local/home/david/dvc-remote/example-versioning'
Preparing to collect status from ssh://david@nuc2.local/home/david/dvc-remote/example-versioning
Collecting information from local cache...
[##############################] 100% 
...

The dvc push command sends data to the remote DVC cache.

The other thing we did was to commit the .dvc/config file to the Git repository. The config file was modified because dvc remote add added an entry to that file.

That means the local Git repository is ready to be cloned somewhere else, and it will already know about the remote DVC cache.

We need to take care of this little issue, first:

$ git remote  -v
origin	https://github.com/iterative/example-versioning.git (fetch)
origin	https://github.com/iterative/example-versioning.git (push)

This workspace started by cloning the existing workspace on Github. We don't have rights to push to that repository, and so we need our own Git repository we can push to. Therefore we need to do this:

$ git remote rm origin

The next question is how to properly clone the Git repository on my laptop to a server, such as this NUC. What I chose to do is to set up a Git repository on a Github-like server (it's actually Gogs, but the same commands will work for a Github repository).

$ git remote add origin ssh://git@GIT-SERVER-DOMAIN:10022/david/example-versioning.git
$ git push -u origin master

This results in:

  1. Git repository is on a Git server that can be accessed via an SSH URL
  2. The DVC cache is on a server that is accessed via an SSH URL

Then on the NUC I ran these commands:

nuc$ mkdir ~/dvc
nuc$ cd ~/dvc
nuc$ git clone ssh://git@GIT-SERVER-DOMAIN:10022/david/example-versioning.git
Cloning into 'example-versioning'...
remote: Counting objects: 48, done.
remote: Compressing objects: 100% (44/44), done.
remote: Total 48 (delta 12), reused 0 (delta 0)
Receiving objects: 100% (48/48), 7.55 KiB | 7.55 MiB/s, done.
Resolving deltas: 100% (12/12), done.

This clones the Git repository. That carried along the DVC configuration file we already set up. The DVC configuration includes the remote DVC cache, and therefore we can pull data from the remote cache like so:

nuc$ dvc remote list
default	ssh://david@nuc2.local:/home/david/dvc-remote/example-versioning
nuc$ dvc pull
Preparing to download data from 'ssh://david@nuc2.local/home/david/dvc-remote/example-versioning'
Preparing to collect status from ssh://david@nuc2.local/home/david/dvc-remote/example-versioning
Collecting information from local cache...
[##############################] 100% 
...

With that we have both the Git repository and its associated DVC dataset duplicated on the NUC.

nuc$ dvc status
Pipeline is up to date. Nothing to reproduce.
nuc$ dvc repro
WARNING: assuming default target 'Dvcfile'.
Stage 'data.dvc' didn't change.
Stage 'Dvcfile' didn't change.
Pipeline is up to date. Nothing to reproduce.

And voila, it's all ready to be used.

From here either workspace could have a change to commit. To do so:

$ git push
$ dvc push

And then in the other workspace, to receive this change:

$ git pull
$ dvc pull
$ dvc status
$ dvc repro

About the Author(s)

(davidherron.com) David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.