How and why to use Git Submodules

; Date: Fri Dec 10 2021

Tags: Git

Sometimes we need to use the contents of one Git repository inside another. A typical example is using the source for a shared library in multiple applications. With Git Submodules, we configure one or more other repositories to check out as child repositories.

Using Git submodules will increase the complexity of your project. Before heading down this road, you should be certain the result will be worth the complexity. As they say:

A programmer had a version control problem and said, “I know, I’ll use submodules.” Now they have two problems.

That's a quip of a famous quip about regular expressions. Regular expressions are extremely useful, and you must be certain they are the correct solution. Similarly, Git submodules are extremely useful, and you must be certain they are the correct solution for your problem.

With the submodules feature, you're effectively embedding a child Git repository into a parent repository. The files of the child repository appear within the file tree of the parent. But changes in the child repository are tracked by the child rather than parent repository. The only record in the parent repository is the SHA commit identifier of the commit to use in the child.

What can you do with a Git repository containing or embedding one or more other repositories?

At it's core, submodules lets you construct a directory hierarchy containing data (source files) from multiple locations. The subsections, a.k.a. submodules, of this hierarchy are independently managed, with each being versioned on its own schedule. Each submodule can be used by multiple parent projects. Each parent project has control over which release of each subsidiary project is being used, and each parent project can push changes to the subsidiary projects.

Those attributes are a rough summary of the advantages. But, as said earlier, this comes with increased complexity. Two examples are:

  1. Anyone who clones the parent repository must remember to run a Git command to cause the submodule repositories to also be checked out.
  2. If there are commits in the repository for a submodule, the submodule SHA commit reference will not be updated until you must run a Git command for that purpose.

Why should one use Git submodules?

I've used Git for many years without needing to know about submodules. I'm sure most of us might go for a whole software development career without using something like submodules. Using them adds complexity, so it had better be worth your time.

Why did I decide to look into using Submodules? It was an issue with the source code repository for the (github.com) AkashaCMS website.

That website documents the AkashaCMS family of tools, a static website generator platform. There are many many modules involved with AkashaCMS, each of which has documentation in the corresponding repository. The akashacms/akashacms-website repository holds a portion of the content appearing on akashacms.com, and the rest is spread across the repositories for each AkashaCMS package. The documentation for each package is therefore tracked in parallel with its changes. But, getting all that documentation to show on akashacms.com requires somehow bringing everything together into one source tree.

In other words, constructing the akashacms.com website requires assembling into one directory tree the content of a dozen or more Git repositories.

In the past, this was handled by embedding the documentation inside each npm package. That meant the documentation could be found inside the node_modules directory tree. But when I decided to remove the documentation files from the npm packages, to reduce the package size, it meant finding another solution. That question led to Git submodules, and here we are.

A more typical reason to use Git submodules is the developers of a set of applications who need to include the same library (or libraries) across those applications. For example, someone might have created a really good MP3 library that's used in multiple audio player applications. Rather than rely on the MP3 library being installed as a shared library on any computer where the app is installed, the application developers might choose to directly embed the library into their compiled application.

Stated generally, one or more application authors might directly embed a given library in their applications.

Such application authors could easily be working on code in both the shared library, and in the application, at the same time. They might find it most convenient to configure the shared library repository as a submodule, so it's all in the same managed source tree, and every developer working on the application would be on the same page.

There are powerful reasons to use submodules. Make sure you understand them, make sure they're what you need. What follows is a tutorial on using Git submodules.

With that in mind, lets get started on how to use Git submodules.

Adding a child Git repository using Submodules

Remember that a Git submodule is a reference to another Git repository. That reference is stored in a parent repository.

To add a Git submodule reference to a parent Git repository use this command:

$ git submodule add GIT-URL-FOR-REPOSITORY path/to/submodule

The GIT-URL can be a local filesystem reference, or any other URL supported by Git. The typical choice will be between HTTPS and SSH URLs.

The path/to/submodule is optional, and specifies the directory where the submodule will land. If this is omitted, the canonical portion of the URL is chosen. For example if the URL is git@github.com:akashacms/akashacms-external-links.git then the default module directory name is akashacms-external-links. Specifying path/to/module overrides the default with whatever directory name you like.

To work with this feature in practice, let's create a pair of repositories on GitHub, or GitLab, or your favorite Git service. Create one parent and the other child. It will help if you initialize each with a README when you do so.

$ git clone git@github.com:robogeek/parent.git
Cloning into 'parent'...
...
$ cd parent
$ git checkout -b main
$ touch main.html
$ git add .
$ git commit -a -m 'Initial revision'
$ git push

Start with the parent repository. Use the correct URL for the repository you created. Notice that I used the SSH URL, because that's the easiest method to be able to push changes to the repository. We also ensured there is a branch named main, and a file that is committed to the repository, and that the file is pushed to the repository. The last few steps make sure there is at least one file in the repository, and that the main branch is named main.

For the child repository perform the same steps in a sibling directory. In the child repository, instead of creating main.html, create child.html, so we can quickly tell them apart. That gives us two repositories, with a few files in each.

Next, change your directory to the parent directory, and run this command:

$ cd ../parent
$ git submodule add git@github.com:robogeek/child.git

This adds a new submodule to a parent repository. The submodule references the repository at the named named URL. Notice that I again used an SSH URL, and that you should use the correct URL for your child repository. We'll later discuss why SSH URLs are being used here, and when you should use HTTPS URLs.

This command creates a file named .gitmodules.

$ cat .gitmodules 
[submodule "child"]
    path = child
    url = git@github.com:robogeek/child.git

You'll see it is a simple data file giving particulars of the submodule which was added. There are also entries in .git/config supporting submodules. A directory, child, was created, and in that directory is the contents of that repository.

We can check the status:

$ git submodule status
 a7192d1b027f624fe250aa53a746b194f88ec72b child (heads/main)

The first part is the SHA-1 of the currently checked out commit for the submodule. The second part is the path to the module. Remember this SHA-1 identifier because we'll be seeing it a few times in subsequent examples.

If the submodule you're adding needs to be on a specific branch, then add the -b branch option like so:

$ git submodule add -b branch-name \
                git@github.com:robogeek/child.git

The branch name is recorded in .gitmodules.

The last interesting thing to notice is in .git/modules where you'll find a new directory, child. Go into that directory and you'll find it is a clone of the child Git repository. Run git log and you'll find the HEAD of that repository is the same commit just shown.

Committing Git Submodule configuration to the repository

We've created a parent repository with a child repository as a submodule. We can repeat this as many times as we like, bringing in other submodule repositories. But, now what? In particular, how do we share this submodule configuration with other members of our team?

The git submodule command records data about submodules in a file named .gitmodules. This file needs to be added to your workspace, as well as other files listed by the git status -s command. The status for both .gitmodules and child are in the A state, meaning they are newly added.

This means those two files are ready to be committed and pushed to the parent repository. That is exactly what's required for sharing the submodule configuration to the origin repository, and from there with other team members.

$ git commit -a -m 'Add submodule'
$ git push

This adds the files, .gitmodules and child, and pushes them to the repository. Go back to your browser and inspect the parent repository. You'll find there is a .gitmodules file, and a directory named child. But, unlike regular directories in Git repositories, it looks like this:

Notice that it reads @ a7192d1. That hex string is the SHA-1 code for the commit in the child repository. Notice that this code matches the leading digits of the SHA-1 shown earlier. It denotes that the child directory is in fact a reference to a specific commit in the child repository.

Cloning a Git repository which has Submodules

We've added a submodule to our repository, and shared the submodule configuration to the Git server. Our fellow developers should be able to just clone the repository and be ready to go, right? Sorry, no.

To see this try these commands:

$ git clone git@github.com:robogeek/parent.git parent2
...
$ cd parent2
$ ls
child    main.html
$ ls child

That is, clone the repository to a new name, to mimic a colleague cloning the repository. You'll find the files from the main repository are checked out, but not for the child repository.

The easiest solution is, when performing the initial git clone to add this option, --recurse-submodules, like so:

$ git clone --recurse-submodules git@github.com:robogeek/parent.git parent3
Cloning into 'parent3'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
Receiving objects: 100% (6/6), done.
remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0
Submodule 'child' (git@github.com:robogeek/child.git) registered for path 'child'
Cloning into '/Volumes/Extra/akasharender/t/parent3/child'...
remote: Enumerating objects: 3, done.        
remote: Counting objects: 100% (3/3), done.        
remote: Total 3 (delta 0), reused 3 (delta 0), pack-reused 0        
Receiving objects: 100% (3/3), done.
Submodule path 'child': checked out 'a7192d1b027f624fe250aa53a746b194f88ec72b'

This time the git clone messages talk about checking out the submodules. Check in parent3/child and you'll see the files are there, and notice that the SHA-1 code shown here still matches the one we saw earlier.

Notice this command cloned into parent3. Therefore, the parent2 directory is as we left it. With it, we can think about another scenario, in which you had an existing clone of the workspace, and needed to update it to include new submodules that had been added. In other words, how do you update the submodules in an existing repository, without rerunning git clone?

To find out let's go into parent2 (which is not updated) and run a command.

$ cd ../parent2
$ git submodule update --init --recursive child
Submodule 'child' (git@github.com:robogeek/child.git) registered for path 'child'
Cloning into '/Volumes/Extra/akasharender/t/parent2/child'...
Submodule path 'child': checked out 'a7192d1b027f624fe250aa53a746b194f88ec72b'

This is one method, using the submodule update command with these two options. The last argument is the path of the submodule. If the submodule path is not specified the update operation will instead run for every submodule.

But, what about those two command line options? The first, --init, makes sure the submodules have been correctly initialized. The second, --recursive, makes sure to check out all submodules, including submodules of submodules.

The two options are short-hand for running these two commands separately:

$ git submodule init
...
$ git submodule update --recursive
...

The submodule update causes the registered submodules to be updated to match what's expected in the parent repository. It does this by cloning any missing submodules and checking out files related to the SHA-1 commit hash.

The difference between HTTPS and SSH URLs in Submodules

So far we've used SSH URLs for the submodule repositories. This sort of URL requires more authentication for the user of the repository. A user of SSH URLs must have their SSH key registered with the remote Git server, and if not they'll get error messages.

$ git clone git@github.com:robogeek/parent.git
Cloning into 'parent'...
Could not create directory '/var/lib/jenkins/.ssh'.
The authenticity of host 'github.com (192.30.255.113)' can't be established.
ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Failed to add the host to the list of known hosts (/var/lib/jenkins/.ssh/known_hosts).
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

This was performed by a user ID which did not have SSH keys, and was certainly not registered with the Git server. This means its nonexistent SSH key cannot be validated, and in general that user ID is prohibited from cloning the repository. That's what the messages show us.

Your repository might solely support developers who have SSH authentication, and who will be committing changes to the repositories. If so, it is appropriate to require SSH key authentication.

But if the repository is to be shared with the public, who obviously have not registered their SSH key with your repository, you must use a different Git URL which allows unauthenticated cloning. The simplest is to use an HTTPS URL.

$ git clone https://github.com/robogeek/parent.git
Cloning into 'parent'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), 546 bytes | 182.00 KiB/s, done.
$ cd parent/
$ ls
child  main.html
$ ls child/

But, the submodules are not checked out. If we try to update the submodules:

$ git submodule update --init --recursive child
...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.
...

We get the same error messages as before. The issue is the SSH URL used in the submodule configuration.

Some repositories, like akashacms/akashacms-website, must be open for anyone to clone the repository with no prior setup. Hence, HTTPS URLs are required for submodules of such sites.

If the submodule had been setup this way:

$ git submodule add https://github.com/robogeek/child.git

Then the submodule configuration instead has HTTPS URLs, and submodule update command will execute correctly. But this creates a different problem if you want to push commits from the child repository to its origin.

Pushing commits to a Git repository requires authentication, of course. With the SSH URL, the authentication is the SSH key. When using an HTTPS URL, authentication is handled through other means.

In the olden days of GitHub, we would be prompted for a user-name and password, and then be good to go. But, security needs have dictated change. Today if we try this, we learn that on August 13, 2021, GitHub removed support for password authentication on HTTPS URLs. The (github.blog) blog post for that announcement says we are supposed to use a Personal Access Token instead. The (docs.github.com) documentation to do this is straightforward.

Fortunately getting that token is easy, and well documented. As soon as we generate the personal access token, this works. The documentation says the token can be used in place of the password that is requested when using the HTTPS URL. We're also instructed to give the token a time limit, and limited access rights.

How to push changes in Submodule to its repository

We've just been talking about pushing commits to a submodule repository, without discussing how to do that. Silly us, jumping ahead of ourselves.

In a correctly cloned parent repository, where the child repository is correctly checked out, type these commands:

$ cd child
$ touch about.html
$ git add .
$ git commit -a -m 'Add about page'
$ git push
fatal: You are not currently on a branch.
To push the history leading to the current (detached HEAD)
state now, use

    git push origin HEAD:<name-of-remote-branch>

The last fails with a somewhat inscrutable message. This message is telling you that the submodule repository is in a detached head state. In fact, it does not have a currently checked out branch.

To learn how to fix this, let's start with a freshly checked out repository.

$ git clone  --recurse-submodules  git@github.com:robogeek/parent.git parent5

This sets up a new clone of the repository, with all submodules checked out.

Try this:

$ cd parent5/child
$ git pull
You are not currently on a branch.

This confirms that the newly initialized repository is not on a branch, and is in the detached head state. On the GitHub/GitLab/etc repository, determine the name of the main branch. This might be main or master or something else.

$ git checkout main
Previous HEAD position was a7192d1 Initial revision
Switched to branch 'main'
Your branch is up to date with 'origin/main'.

We've checked out the main branch, which is the same as the main branch on the GitHub repository. This updates the HEAD of the repository, and indicates it is copacetic with origin/main. All is good.

$ touch news.html
$ git add .
$ git commit -a -m 'Add news page'
[main 6a1f930] Add news page
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 news.html
$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 254 bytes | 254.00 KiB/s, done.
Total 2 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:robogeek/child.git
   001f8e0..6a1f930  main -> main

We proceed with adding our new file, and we can simply use git push to send the change to the GitHub server.

How to update a Submodule if its repository has changed

In the previous section we pushed a commit to our submodule. Shouldn't we be able to make a new clone of the repository, and see that change?

$ git clone  --recurse-submodules  git@github.com:robogeek/parent.git parent6
...
$ cd parent6/child
$ ls
child.html

Okay, that didn't work as expected. We expected parent6/child to contain the files we pushed to the child repository in the previous step. What's up? This has to do with the SHA-1 commit recorded in the parent repository. If you view the parent repository on GitHub, you'll find the commit hash hasn't changed. Remember that when we clone the repository, the submodules are checked out relative to the SHA-1 commit hash.

The cure is to run this command in the parent repository:

$ cd ..
$ git submodule update --recursive --remote
Submodule path 'child': checked out '6a1f93010a657c3453ea82b9d69d41462402638c'
$ ls child/
about.html child.html news.html

Now the child repository is updated to the latest commit in its repository. Notice that the SHA-1 has changed.

$ git status -s
 M child

Further, we have a change to commit in the parent repository, which is the updated SHA-1 commit hash.

$ git commit -a -m 'Update submodule references'
[main 1356324] Update submodule references
 1 file changed, 1 insertion(+), 1 deletion(-)
$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 314 bytes | 314.00 KiB/s, done.
Total 2 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:robogeek/parent.git
   b5c3802..1356324  main -> main

This updates the GitHub repository. Let's verify that we can now correctly clone the repository:

$ git clone  --recurse-submodules  git@github.com:robogeek/parent.git parent7
Cloning into 'parent7'...
...
$ ls parent7/child/
about.html child.html news.html

And, indeed, it is correctly checked out.

Deleting submodule configuration from a Git repository

For the last task to learn about, we might have decided submodules are way more complex than we want to deal with. Or there might be another reason to delete the submodule configuration.

Whatever your reason, let's learn how to remove a submodule from a Git repository, and then commit that change to the GitHub repository.

We should start by examining the traces that store the submodule configuration:

$ cat .gitmodules 
[submodule "child"]
    path = child
    url = git@github.com:robogeek/child.git

$ cat .git/config 
[core]
    ...
[submodule]
    active = .
[remote "origin"]
    url = git@github.com:robogeek/parent.git
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "main"]
    remote = origin
    merge = refs/heads/main
[submodule "child"]
    url = git@github.com:robogeek/child.git

$ ls .git/modules/
child

There are three things here, the .gitmodules file, the .git/config file, and a directory in .git/modules.

$ git submodule deinit child
Cleared directory 'child'
Submodule 'child' (git@github.com:robogeek/child.git) unregistered for path 'child'

This removes the entry from .git/config and in other ways ensures that git submodule init and other commands will not act on the child.

$ git rm child/
rm 'child'

This removes the directory containing the submodule, and removes the entry from .gitmodules.

At this point .git/modules/child still exists:

$ rm -rf .git/modules/child/

And, this is the old school way of removing that directory. It's not clear why that isn't deleted by the other commands.

$ git status -s
M  .gitmodules
D  child

$ git commit -a -m 'Remove child submodule'
[main 48380f1] Remove child submodule
 2 files changed, 4 deletions(-)
 delete mode 160000 child

$ git push
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 4 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 258 bytes | 258.00 KiB/s, done.
Total 2 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:robogeek/parent.git
   1356324..48380f1  main -> main

We have changes to commit. We commit those changes, and push them to the repository. And, if you go to GitHub you'll find the submodule reference is gone.

Summary

We have learned about an important part of Git which most of us don't use. Using submodules we can cause the contents of one repository to be embedded within another.

There are several possible reasons to do this, which we discussed earlier. It is a potentially powerful tool, but like so many powerful tools it must be used with care.

It is always best to automate administrative processes. In a case like this, all that's required is shell scripts (or the like). The more complex the process, the more important it is to automate it, to limit the chance you'll forget a step or two in an arcane sequence of commands.

Here's a few useful references:

About the Author(s)

David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.