; Date: Sun Jul 07 2019
Tags: Machine Learning
DVC is a powerful set of tools for managing data files associated with data science or machine learning projects. The code for such a project is committed to a Git repository, and DVC manages the data files in parallel to that repository.
DVC is written in Python, so your computer must of that installed. It runs on Windows, Linux or Mac OS X. We will go through setting up a simple prebaked DVC project to explore its capabilities. But first we must install the software.
Installing DVC on Linux, Mac OS X or Windows
The basic installation instruction is:
$ pip install dvc
These methods can be used if you prefer to use one of these package management systems.
It can be installed from a
$ sudo wget https://dvc.org/deb/dvc.list -O /etc/apt/sources.list.d/dvc.list $ sudo apt-get update $ sudo apt-get install dvc
$ sudo wget https://dvc.org/rpm/dvc.repo -O /etc/yum.repos.d/dvc.repo $ sudo yum update $ sudo yum install dvc
Or via Homebrew (Mac OS X) via one of these mechanisms:
$ brew install iterative/homebrew-dvc/dvc ...OR... $ brew cask install iterative/homebrew-dvc/dvc
Whichever method you used, it's helpful to verify the installation worked. These two commands will be useful:
$ dvc help .... prints help output including the available commands $ dvc --version .... prints the version you installed
Setting up the example project
Now that we have DVC installed let's set up a sample project.
The DVC team has created a Getting Started example that we can install. It is an text classifier example. The advantage here is is quick to set up, and doesn't take up much space or execution time, while still being real machine learning code.
This is also an example of how easy DVC makes it to share workspaces with colleagues.
git clone https://github.com/iterative/example-get-started cd example-get-started virtualenv -p python3 .env (optional) source .env/bin/activate (optional) pip install -r requirements.txt dvc pull
You should of course already have Python, Git, virtualenv and pip installed on your computer.
This downloads source code from a Git repository. In the source tree is a script we can use to install dependencies like Pandas or scikit-learn. The last step,
dvc pull, downloads a prebaked cache directory from a remote cache. As we said, this demonstrates how easy DVC makes it to share both the code and the dataset with colleagues, or to otherwise deploy a project into any environment you like.
The source tree looks like so:
$ tree . ├── README.md ├── auc.metric ├── data │ ├── data.xml │ ├── data.xml.dvc │ ├── features │ │ ├── test.pkl │ │ └── train.pkl │ └── prepared │ ├── test.tsv │ └── train.tsv ├── evaluate.dvc ├── featurize.dvc ├── model.pkl ├── prepare.dvc ├── requirements.txt ├── src │ ├── evaluate.py │ ├── featurization.py │ ├── prepare.py │ └── train.py └── train.dvc 4 directories, 18 files
DVC maintains a hidden directory containing a configuration file (.dvc/config), the aforementioned cache, and some other data. One of the settings in this example workspace is the URL for a prebaked DVC remote cache. DVC supports both local and remote cache’s, the latter being used for sharing data with colleagues.
There are a pair of commands,
dvc push and
dvc pull, that are analogous to the
git push and
git pull commands. With
dvc push we send data to a remote cache. With the
dvc pull command we download data from a remote cache. In this case, we end up with this directory structure:
$ tree .dvc .dvc ├── cache │ ├── 38 │ │ └── 63d0e317dee0a55c4e59d2ec0eef33 │ ├── 42 │ │ ├── c7025fc0edeb174069280d17add2d4.dir │ │ └── c7025fc0edeb174069280d17add2d4.dir.unpacked │ │ ├── test.pkl │ │ └── train.pkl │ ├── 58 │ │ └── 245acfdc65b519c44e37f7cce12931 │ ├── 68 │ │ ├── 36f797f3924fb46fcfd6b9f6aa6416.dir │ │ └── 36f797f3924fb46fcfd6b9f6aa6416.dir.unpacked │ │ ├── test.tsv │ │ └── train.tsv │ ├── 9d │ │ └── 603888ec04a6e75a560df8678317fb │ ├── a3 │ │ └── 04afb96060aad90176268345e10355 │ ├── aa │ │ └── 35101ce881d04b41d5b4ff3593b423 │ └── dc │ └── a9c512fda11293cfee7617b66648dc ├── config ├── lock ├── state ├── updater └── updater.lock
Now that we have learned a little about DVC, and have a working DVC project directory, let’s explore how DVC efficiently deals with the local cache directory.
Associating data files with Git commits and experiments
The workspace we checked out has a number of tags describing each commit.
$ git tag 0-empty 1-initialize 2-remote 3-add-file 4-sources 5-preparation 6-featurization 7-train 8-evaluation 9-bigrams baseline-experiment bigrams-experiment
The first several concern setting up the project, and the last couple tags are obviously different experiments with the machine learning model.
We can easily revisit the state of the workspace at any of these tags. The first step is obvious:
$ git checkout 9-bigrams
But if you consult the workspace, the source files will have changed (
src/featurization.py) but the data files will not have changed. DVC provides a command to update which data files are checked out:
$ dvc checkout
What happens is that the files with filenames ending in
.dvc contain the current list of files. When we checkout a different Git commit, this changes which instance of each
.dvc file is checked out. The
dvc checkout command goes through the
.dvc files, and retrieves from the DVC cache the correct instance of each file.
One way to see the difference between tags is with this command:
$ git diff baseline-experiment bigrams-experiment
In the diff output you’ll see that
src/featurization.py was changed, but so too were all the
.dvc files. As we use
git checkout to navigate through the tags in the project, the
.dvc refer to different instances of each file in the DVC cache. The
dvc checkout command then synchronizes the checked out data files.
As a convenience this command installs git hook’s that automatically execute
dvc checkout every time
git checkout is run:
$ dvc install
The first time we used
git checkout, Git printed a message about being in a “detached head” state. It is easy to return to the normal state with this command:
$ git checkout master
While in a detached head state we are able to create a branch based to contain any work we want to do. This Stack Overflow answer has some good advice about that.
The advice there also applies to a DVC project. When
dvc checkout is run, DVC does not know that the workspace is in a Git branch. It just knows that the
.dvc files list the files that are to be checked out. That means the
git checkout and
dvc checkout combination can be used for any Git tag, commit, or branch.