; Date: Tue Jul 23 2019
Tags: Machine Learning
DVC is a powerful set of tools for managing data files associated with data science or machine learning projects. It works hand-in-hand with a Git repository to track both the code and the datasets in an ML project. A core feature is for versioning datasets, meaning that it correlates the dataset to exactly match what existed at each Git commit. By using a DVC "remote cache" it is very easy to share a project with colleagues, or to copy the dataset to a remote machine.
In this tutorial we will go over a simple image classifier. We will learn how DVC works in a machine learning project, how it optimizes reproducing results when the project is changed, and how to share the project with colleagues.
Thank's to DVC's feature-set all those things are very easily accomplished.
DVC is written in Python, so your computer must of that installed. It runs on Windows, Linux or Mac OS X. We will go through setting up a simple prebaked DVC project to explore its capabilities. But first we must install the software.
For installation instructions see the first part of Introduction to using DVC to manage machine learning project datasets -- then come back here for the rest of this tutorial.
Setting up the example project
This tutorial is expanded from an existing Get Started tutorial on the DVC website. Let’s quickly go through its setup.
git clone https://github.com/iterative/example-versioning.git cd example-versioning/ pip install -r requirements.txt wget https://dvc.org/s3/examples/versioning/data.zip unzip data.zip rm -f data.zip
What we've done so far is to clone a Git repository, then download the first half of the dataset. The
data.zip archive contains prelabeled pictures of cats and dogs, and the code in the Git repository contains an image classifier algorithm that will be trained to distinguish cats from dogs.
Our directory contains this:
$ ls -a . .dvc .gitignore data train.py .. .git README.md requirements.txt
train.py training script, and the
data directory, there is a
.git directory containing the Git repository, and a
.dvc directory. This directory is used by DVC for its housekeeping, which we'll learn about during the course of this tutorial.
For now let's do this:
$ dvc add
This adds the data directory to the DVC cache. The DVC cache is within the
.dvc directory, and contains a mirror of the files in the workspace using MD5 checksums to track the files. If a file changes over the course of a project, the MD5 checksum will differ, and DVC will end up storing two instances of the file in its cache, each indexed by the corresponding MD5 checksum.
The "add data" command told us this:
Saving 'data' to '.dvc/cache/b8/f4d5a78e55e88906d5f4aeaf43802e.dir'.
From the file name, that's obviously in the cache directory. You'll see that a MD5 checksum is part of the file name.
We were also told to do this:
git add .dvc/.gitignore .gitignore data.dvc
We have two systems operating in this directory - Git is being used to store the source code files, and the files ending with
.dvc. Those files, the DVC files, contain the MD5 checksums that DVC uses to correlate the files in the cache with files that are supposed to be in the workspace.
The other system is DVC, whose data is in the
.dvc directory. It works in concert with Git, and it manages the files which are impractical to manage using Git. Git does not deal well with large binary files, and that's where DVC comes in. DVC adds an entry to appropriate
.gitignore files for each file that DVC is tracking.
During the course of a project you'll naturally change some of the data files. DVC keeps the master copy of each instance of each file in the DVC cache, using the MD5 checksum as its file name. It then links the correct instance of each file into the workspace using the file name we think it has.
The next thing to do is to run the script to train the model. We run it like so:
$ dvc run -f Dvcfile \ -d train.py -d data \ -M metrics.json \ -o model.h5 -o bottleneck_features_train.npy \ -o bottleneck_features_validation.npy \ python train.py
This command informs DVC of a processing step we want to use in this project. This says to use Python to run
train.py, with input files named using the
-d options, and output files named using the
-o options. The
-M option names an output file which contains a Metric.
We could have run the command by hand, of course. But by running it this way we've told DVC everything about this command, and now DVC can run it for us when needed. For example if
train.py is changed, or the contents of the
data directory are changed, DVC knows to rerun
train.py because its dependencies have changed. "Changed" is detected because the MD5 checksum will have changed.
DVC supports creating a multi-step pipeline if you use
dvc run multiple times. For example you might have steps to download a file, unpack the file, process the data to generate usable data, then run training and validation scripts. In this case we have just one script to run, but don't many ML projects have lots of individual steps like that?
dvc run command goes ahead and executes the named program. And when it's done we're told this:
Adding 'model.h5' to '.gitignore'. Adding 'bottleneck_features_train.npy' to '.gitignore'. Adding 'bottleneck_features_validation.npy' to '.gitignore'. Output 'metrics.json' doesn't use cache. Skipping saving. Saving 'model.h5' to '.dvc/cache/9a/8fabce95b8a4e94011fdc911030339'. Saving 'bottleneck_features_train.npy' to '.dvc/cache/da/9e20b12aa5b2dc0abb02e1a1b4e4cf'. Saving 'bottleneck_features_validation.npy' to '.dvc/cache/e5/48cc847339c990a7dbe0759d87c7c4'. Saving information to 'Dvcfile'. To track the changes with git, run: git add .gitignore .gitignore .gitignore Dvcfile
Again we have files that have been added to the DVC cache and corresponding updates to the
.gitignore files. The file,
Dvcfile, is the same format and purpose as the
.dvc files in that it tracks the dependencies to the command, the outputs from the command, and the command itself.
At this time we want to record a tag because the first phase of the project is finished.
$ git add . $ git commit -m "model first version, 1000 images" $ git tag -a "v1.0" -m "model v1.0, 1000 images"
Adding more images to the dataset - second phase of the project
Suppose you’re not satisfied with the results, and you realize you need a larger set of training and validation images. After locating those images you can add them to the workspace like so:
$ wget https://dvc.org/s3/examples/versioning/new-labels.zip $ unzip new-labels.zip $ rm -f new-labels.zip $ dvc add data [##############################] 100% Created unpacked dir Computing md5 for a large number of files. This is only done once. [##############################] 100% WARNING: Output 'data' of 'data.dvc' changed because it is 'modified' Saving 'data' to '.dvc/cache/21/060888834f7220846d1c6f6c04e649.dir'. Saving information to 'data.dvc'. To track the changes with git, run: git add data.dvc
data.dvc was updated because we added files to the
If you refer back to the
train.py command, the
data directory is listed as a dependency. Clearly since we've added new images to the dataset, we must retrain the model. DVC makes this easy.
$ dvc status [##############################] 100% Created unpacked dir Dvcfile: changed deps: modified: data
First, DVC notices that dependencies have changed to the command/stage recorded in
The dvc status command will tell us that new files have been added. Because we have new files, we can rerun the training script to generate a new model. Because we already told DVC how to run the training script, we can simply use
dvc repro (short for reproduce) to re-execute that script.
$ dvc repro WARNING: assuming default target 'Dvcfile'. Stage 'data.dvc' didn't change. WARNING: Dependency 'data' of 'Dvcfile' changed because it is 'modified'. WARNING: Stage 'Dvcfile' changed. Reproducing 'Dvcfile' Running command: python train.py Using TensorFlow backend. Found 2000 images belonging to 2 classes. Found 800 images belonging to 2 classes. Train on 2000 samples, validate on 800 samples Epoch 1/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.6299 - acc: 0.7725 - val_loss: 0.2764 - val_acc: 0.8900 Epoch 2/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.3847 - acc: 0.8495 - val_loss: 0.3442 - val_acc: 0.8625 Epoch 3/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.3257 - acc: 0.8750 - val_loss: 0.2785 - val_acc: 0.8987 Epoch 4/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.2600 - acc: 0.9045 - val_loss: 0.3147 - val_acc: 0.8975 Epoch 5/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.2631 - acc: 0.9095 - val_loss: 0.3194 - val_acc: 0.8912 Epoch 6/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.2232 - acc: 0.9150 - val_loss: 0.3132 - val_acc: 0.8925 Epoch 7/10 2000/2000 [==============================] - 7s 3ms/step - loss: 0.1915 - acc: 0.9300 - val_loss: 0.4042 - val_acc: 0.8950 Epoch 8/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.1772 - acc: 0.9345 - val_loss: 0.3480 - val_acc: 0.9012 Epoch 9/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.1603 - acc: 0.9435 - val_loss: 0.3801 - val_acc: 0.9037 Epoch 10/10 2000/2000 [==============================] - 6s 3ms/step - loss: 0.1329 - acc: 0.9500 - val_loss: 0.5123 - val_acc: 0.8862 Output 'bottleneck_features_validation.npy' didn't change. Skipping saving. Output 'metrics.json' doesn't use cache. Skipping saving. Saving 'model.h5' to '.dvc/cache/da/c1587c45f6ee04d926039a80554e86'. Saving 'bottleneck_features_train.npy' to '.dvc/cache/39/ce773781a79d63090ad47ce643f9e5'. WARNING: data 'bottleneck_features_validation.npy' exists. Removing before checkout. Saving information to 'Dvcfile'. To track the changes with git, run: git add Dvcfile $ git add Dvcfile $ git status -s M Dvcfile M data.dvc M metrics.json
The model (
model.h5) has been retrained, and saved to the DVC cache. Notice that it's file name in the cache changed, because its content changed and therefore its MD5 checksum changed.
There were corresponding changes to the files tracked in the Git repository as well. Those must be saved:
$ git commit -m "model second version, 2000 images" [master 21042e1] model second version, 2000 images 2 files changed, 11 insertions(+), 11 deletions(-) $ git tag -a "v2.0" -m "model v2.0, 2000 images" $ git tag v1.0 v2.0
We now have two tags in the Git repository, because version 2 of the project is finished.
Demonstrate switching between revisions
What we have now is a little sample machine learning project that has gone through two revisions. What if we decide to take a look at an earlier revision of the project? For example we're interested in tweaking something we tried earlier.
DVC makes it easy to change the dataset to match the Git repository. It does this using the DVC files that have been created, and updated, as we went through the steps. Each instance of each DVC file contains the MD5 checksum of related files.
$ cat Dvcfile deps: - path: train.py md5: 78d98f3865c2fcfe1dbe95b738960d0a - path: data md5: 21060888834f7220846d1c6f6c04e649.dir cmd: python train.py md5: b2b93f6a4f9728e7378348ae41edc67f outs: - cache: true metric: false path: model.h5 md5: dac1587c45f6ee04d926039a80554e86 persist: false - cache: true metric: false path: bottleneck_features_train.npy md5: 39ce773781a79d63090ad47ce643f9e5 persist: false - cache: true metric: false path: bottleneck_features_validation.npy md5: e548cc847339c990a7dbe0759d87c7c4 persist: false - cache: false metric: true path: metrics.json md5: 8af9a62abc596ab8b8faa6f3a751096d persist: false wdir: .
You'll see the
cmd field records the training command, the
deps section lists the files named as dependencies, and the
outs section names the output files.
What happens if we use Git to checkout an earlier tag?
$ git checkout v1.0 M metrics.json Note: checking out 'v1.0'. You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout. If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example: git checkout -b <new-branch-name> HEAD is now at b96e2b2 model first version, 1000 images $ dvc checkout ... [##############################] 100% Checkout finished!
Git tells us that we're in a detached HEAD state. We'll have to remember that for later. The result is that the old version of
Dvcfile were checked out, and those files contain different MD5 checksums for everything. Therefore we ran
dvc checkout so it would fix up the data files checked out of the DVC cache.
If we look again at
Dvcfile we see different checksums.
$ cat Dvcfile deps: - path: train.py md5: 78d98f3865c2fcfe1dbe95b738960d0a - path: data md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir cmd: python train.py md5: 5fcd53385d82ba5e9162e883efb0e094 outs: - cache: true metric: false path: model.h5 md5: 9a8fabce95b8a4e94011fdc911030339 persist: false - cache: true metric: false path: bottleneck_features_train.npy md5: da9e20b12aa5b2dc0abb02e1a1b4e4cf persist: false - cache: true metric: false path: bottleneck_features_validation.npy md5: e548cc847339c990a7dbe0759d87c7c4 persist: false - cache: false metric: true path: metrics.json md5: 22a409f25e89741b2748cea1ad1b9c30 persist: false wdir: .
dvc checkout command takes a look at these MD5 checksums, and then rearranges the dataset to match the set of files specified by the contents of the DVC files.
$ git checkout v2.0 ... HEAD is now at 21042e1 model second version, 2000 images $ dvc checkout ... [##############################] 100% Checkout finished! $ du -sk data 68276 data $ git checkout v1.0 HEAD is now at b96e2b2 model first version, 1000 images $ dvc checkout ... [##############################] 100% Checkout finished! $ du -sk data 43804 data
With the workspace checked out to the v2.0 tag, the
data directory has 68 MB of data, and with the v1.0 tag it has 43 MB of data. If you remember, with the v1.0 tag we had one set of cat and dog images, and for the v2.0 tag we added some more.
The last thing to do is to simplify checking out a given tag.
$ git checkout master $ dvc checkout $ dvc install
Remember that having checked out these tags, that the workspace was in a "detached head" state. The cure for the detached head state is to run "git checkout master", after which we were required to follow up with "dvc checkout".
The last command, "dvc install" installs Git hooks so that the "dvc checkout" command is automatically run.
Demonstrate sharing a machine learning project with colleagues, or deploying it to a remote system
The last thing we want to go over is the method for copying a DVC workspace from one location to another, and for copying revisions from one DVC workspace to another. We might be wanting to share the workspace with a colleague, perhaps for collaboration, or we might want to deploy it to a cloud server for heavy-duty computation.
Remember there are two parts to this workspace. The Git repository, and the DVC files including the DVC cache.
The process to move this from one place to another is fairly easy, but a little more than simply copying the directory structure to another place.
The result of the following process is that
- We have a master Git repository, and master DVC cache
- The workspace we've used so far on the laptop is configured with both these repositories
- A new workspace is setup on a remote machine also configured with both these repositories
- Either workspace can be used to push data or code into both these repositories
- Either workspace can be used to pull data or code from both these repositories
The first thing we must do is to upload the dataset to a remote DVC cache.
$ dvc remote add default ssh://email@example.com:/home/david/dvc-remote/example-versioning $ dvc config core.remote default $ git status -s M .dvc/config $ git add .dvc/config $ git commit -a -m 'define remote' Pipeline is up to date. Nothing to reproduce. [master 7d7ecda] define remote 1 file changed, 2 insertions(+)
Here we set up a DVC remote so we can push the dataset to a remote server. DVC supports several remote file protocols for this, such as Amazon S3. An SSH server is very easy to set up, simply create a directory on the remote system and then specify the SSH URL as shown here.
In my case the "remote" machine is an Intel NUC sitting on my desk next to my laptop which serves various purposes.
A remote DVC cache is similar to the local DVC cache we've been using. The role of a remote cache is that we can push data to the remote cache, or pull data from the remote cache. Therefore it's easy to share the dataset with others via a remote cache.
$ dvc push Preparing to upload data to 'ssh://firstname.lastname@example.org/home/david/dvc-remote/example-versioning' Preparing to collect status from ssh://email@example.com/home/david/dvc-remote/example-versioning Collecting information from local cache... [##############################] 100% ...
dvc push command sends data to the remote DVC cache.
The other thing we did was to commit the
.dvc/config file to the Git repository. The config file was modified because
dvc remote add added an entry to that file.
That means the local Git repository is ready to be cloned somewhere else, and it will already know about the remote DVC cache.
We need to take care of this little issue, first:
$ git remote -v origin https://github.com/iterative/example-versioning.git (fetch) origin https://github.com/iterative/example-versioning.git (push)
This workspace started by cloning the existing workspace on Github. We don't have rights to push to that repository, and so we need our own Git repository we can push to. Therefore we need to do this:
$ git remote rm origin
The next question is how to properly clone the Git repository on my laptop to a server, such as this NUC. What I chose to do is to set up a Git repository on a Github-like server (it's actually Gogs, but the same commands will work for a Github repository).
$ git remote add origin ssh://git@GIT-SERVER-DOMAIN:10022/david/example-versioning.git $ git push -u origin master
This results in:
- Git repository is on a Git server that can be accessed via an SSH URL
- The DVC cache is on a server that is accessed via an SSH URL
Then on the NUC I ran these commands:
nuc$ mkdir ~/dvc nuc$ cd ~/dvc nuc$ git clone ssh://git@GIT-SERVER-DOMAIN:10022/david/example-versioning.git Cloning into 'example-versioning'... remote: Counting objects: 48, done. remote: Compressing objects: 100% (44/44), done. remote: Total 48 (delta 12), reused 0 (delta 0) Receiving objects: 100% (48/48), 7.55 KiB | 7.55 MiB/s, done. Resolving deltas: 100% (12/12), done.
This clones the Git repository. That carried along the DVC configuration file we already set up. The DVC configuration includes the remote DVC cache, and therefore we can pull data from the remote cache like so:
nuc$ dvc remote list default ssh://firstname.lastname@example.org:/home/david/dvc-remote/example-versioning nuc$ dvc pull Preparing to download data from 'ssh://email@example.com/home/david/dvc-remote/example-versioning' Preparing to collect status from ssh://firstname.lastname@example.org/home/david/dvc-remote/example-versioning Collecting information from local cache... [##############################] 100% ...
With that we have both the Git repository and its associated DVC dataset duplicated on the NUC.
nuc$ dvc status Pipeline is up to date. Nothing to reproduce. nuc$ dvc repro WARNING: assuming default target 'Dvcfile'. Stage 'data.dvc' didn't change. Stage 'Dvcfile' didn't change. Pipeline is up to date. Nothing to reproduce.
And voila, it's all ready to be used.
From here either workspace could have a change to commit. To do so:
$ git push $ dvc push
And then in the other workspace, to receive this change:
$ git pull $ dvc pull $ dvc status $ dvc repro