; Date: Fri Dec 10 2021
Sometimes we need to use the contents of one Git repository inside another. A typical example is using the source for a shared library in multiple applications. With Git Submodules, we configure one or more other repositories to check out as child repositories.
Using Git submodules will increase the complexity of your project. Before heading down this road, you should be certain the result will be worth the complexity. As they say:
A programmer had a version control problem and said, “I know, I’ll use submodules.” Now they have two problems.
That's a quip of a famous quip about regular expressions. Regular expressions are extremely useful, and you must be certain they are the correct solution. Similarly, Git submodules are extremely useful, and you must be certain they are the correct solution for your problem.
With the submodules feature, you're effectively embedding a child Git repository into a parent repository. The files of the child repository appear within the file tree of the parent. But changes in the child repository are tracked by the child rather than parent repository. The only record in the parent repository is the SHA commit identifier of the commit to use in the child.
What can you do with a Git repository containing or embedding one or more other repositories?
At it's core, submodules lets you construct a directory hierarchy containing data (source files) from multiple locations. The subsections, a.k.a. submodules, of this hierarchy are independently managed, with each being versioned on its own schedule. Each submodule can be used by multiple parent projects. Each parent project has control over which release of each subsidiary project is being used, and each parent project can push changes to the subsidiary projects.
Those attributes are a rough summary of the advantages. But, as said earlier, this comes with increased complexity. Two examples are:
- Anyone who clones the parent repository must remember to run a Git command to cause the submodule repositories to also be checked out.
- If there are commits in the repository for a submodule, the submodule SHA commit reference will not be updated until you must run a Git command for that purpose.
Why should one use Git submodules?
I've used Git for many years without needing to know about submodules. I'm sure most of us might go for a whole software development career without using something like submodules. Using them adds complexity, so it had better be worth your time.
Why did I decide to look into using Submodules? It was an issue with the source code repository for the AkashaCMS website.
That website documents the AkashaCMS family of tools, a static website generator platform. There are many many modules involved with AkashaCMS, each of which has documentation in the corresponding repository. The
akashacms/akashacms-website repository holds a portion of the content appearing on
akashacms.com, and the rest is spread across the repositories for each AkashaCMS package. The documentation for each package is therefore tracked in parallel with its changes. But, getting all that documentation to show on
akashacms.com requires somehow bringing everything together into one source tree.
In other words, constructing the
akashacms.com website requires assembling into one directory tree the content of a dozen or more Git repositories.
In the past, this was handled by embedding the documentation inside each npm package. That meant the documentation could be found inside the
node_modules directory tree. But when I decided to remove the documentation files from the npm packages, to reduce the package size, it meant finding another solution. That question led to Git submodules, and here we are.
A more typical reason to use Git submodules is the developers of a set of applications who need to include the same library (or libraries) across those applications. For example, someone might have created a really good MP3 library that's used in multiple audio player applications. Rather than rely on the MP3 library being installed as a shared library on any computer where the app is installed, the application developers might choose to directly embed the library into their compiled application.
Stated generally, one or more application authors might directly embed a given library in their applications.
Such application authors could easily be working on code in both the shared library, and in the application, at the same time. They might find it most convenient to configure the shared library repository as a submodule, so it's all in the same managed source tree, and every developer working on the application would be on the same page.
There are powerful reasons to use submodules. Make sure you understand them, make sure they're what you need. What follows is a tutorial on using Git submodules.
With that in mind, lets get started on how to use Git submodules.
Adding a child Git repository using Submodules
Remember that a Git submodule is a reference to another Git repository. That reference is stored in a parent repository.
To add a Git submodule reference to a parent Git repository use this command:
$ git submodule add GIT-URL-FOR-REPOSITORY path/to/submodule
The GIT-URL can be a local filesystem reference, or any other URL supported by Git. The typical choice will be between HTTPS and SSH URLs.
path/to/submodule is optional, and specifies the directory where the submodule will land. If this is omitted, the canonical portion of the URL is chosen. For example if the URL is
email@example.com:akashacms/akashacms-external-links.git then the default module directory name is
path/to/module overrides the default with whatever directory name you like.
To work with this feature in practice, let's create a pair of repositories on GitHub, or GitLab, or your favorite Git service. Create one
parent and the other
child. It will help if you initialize each with a README when you do so.
git clone firstname.lastname@example.org:robogeek/parent.git Cloning into 'parent'... ... cd parent git checkout -b main touch main.html git add . git commit -a -m 'Initial revision' git push
Start with the parent repository. Use the correct URL for the repository you created. Notice that I used the SSH URL, because that's the easiest method to be able to push changes to the repository. We also ensured there is a branch named
main, and a file that is committed to the repository, and that the file is pushed to the repository. The last few steps make sure there is at least one file in the repository, and that the main branch is named
For the child repository perform the same steps in a sibling directory. In the child repository, instead of creating
child.html, so we can quickly tell them apart. That gives us two repositories, with a few files in each.
Next, change your directory to the
parent directory, and run this command:
cd ../parent git submodule add email@example.com:robogeek/child.git
This adds a new submodule to a parent repository. The submodule references the repository at the named named URL. Notice that I again used an SSH URL, and that you should use the correct URL for your child repository. We'll later discuss why SSH URLs are being used here, and when you should use HTTPS URLs.
This command creates a file named
$ cat .gitmodules [submodule "child"] path = child url = firstname.lastname@example.org:robogeek/child.git
You'll see it is a simple data file giving particulars of the submodule which was added. There are also entries in
.git/config supporting submodules. A directory,
child, was created, and in that directory is the contents of that repository.
We can check the status:
$ git submodule status a7192d1b027f624fe250aa53a746b194f88ec72b child (heads/main)
The first part is the SHA-1 of the currently checked out commit for the submodule. The second part is the path to the module. Remember this SHA-1 identifier because we'll be seeing it a few times in subsequent examples.
If the submodule you're adding needs to be on a specific branch, then add the
-b branch option like so:
$ git submodule add -b branch-name \ email@example.com:robogeek/child.git
The branch name is recorded in
The last interesting thing to notice is in
.git/modules where you'll find a new directory,
child. Go into that directory and you'll find it is a clone of the
child Git repository. Run
git log and you'll find the HEAD of that repository is the same commit just shown.
Committing Git Submodule configuration to the repository
We've created a parent repository with a child repository as a submodule. We can repeat this as many times as we like, bringing in other submodule repositories. But, now what? In particular, how do we share this submodule configuration with other members of our team?
git submodule command records data about submodules in a file named
.gitmodules. This file needs to be added to your workspace, as well as other files listed by the
git status -s command. The status for both
child are in the
A state, meaning they are newly added.
This means those two files are ready to be committed and pushed to the parent repository. That is exactly what's required for sharing the submodule configuration to the origin repository, and from there with other team members.
$ git commit -a -m 'Add submodule' $ git push
This adds the files,
child, and pushes them to the repository. Go back to your browser and inspect the
parent repository. You'll find there is a
.gitmodules file, and a directory named
child. But, unlike regular directories in Git repositories, it looks like this:
Notice that it reads
@ a7192d1. That hex string is the SHA-1 code for the commit in the
child repository. Notice that this code matches the leading digits of the SHA-1 shown earlier. It denotes that the
child directory is in fact a reference to a specific commit in the
Cloning a Git repository which has Submodules
We've added a submodule to our repository, and shared the submodule configuration to the Git server. Our fellow developers should be able to just clone the repository and be ready to go, right? Sorry, no.
To see this try these commands:
git clone firstname.lastname@example.org:robogeek/parent.git parent2 ... cd parent2 ls child main.html ls child
That is, clone the repository to a new name, to mimic a colleague cloning the repository. You'll find the files from the main repository are checked out, but not for the
The easiest solution is, when performing the initial
git clone to add this option,
--recurse-submodules, like so:
$ git clone --recurse-submodules email@example.com:robogeek/parent.git parent3 Cloning into 'parent3'... remote: Enumerating objects: 6, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (4/4), done. Receiving objects: 100% (6/6), done. remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0 Submodule 'child' (firstname.lastname@example.org:robogeek/child.git) registered for path 'child' Cloning into '/Volumes/Extra/akasharender/t/parent3/child'... remote: Enumerating objects: 3, done. remote: Counting objects: 100% (3/3), done. remote: Total 3 (delta 0), reused 3 (delta 0), pack-reused 0 Receiving objects: 100% (3/3), done. Submodule path 'child': checked out 'a7192d1b027f624fe250aa53a746b194f88ec72b'
This time the
git clone messages talk about checking out the submodules. Check in
parent3/child and you'll see the files are there, and notice that the SHA-1 code shown here still matches the one we saw earlier.
Notice this command cloned into
parent3. Therefore, the
parent2 directory is as we left it. With it, we can think about another scenario, in which you had an existing clone of the workspace, and needed to update it to include new submodules that had been added. In other words, how do you update the submodules in an existing repository, without rerunning
To find out let's go into
parent2 (which is not updated) and run a command.
$ cd ../parent2 $ git submodule update --init --recursive child Submodule 'child' (email@example.com:robogeek/child.git) registered for path 'child' Cloning into '/Volumes/Extra/akasharender/t/parent2/child'... Submodule path 'child': checked out 'a7192d1b027f624fe250aa53a746b194f88ec72b'
This is one method, using the
submodule update command with these two options. The last argument is the path of the submodule. If the submodule path is not specified the
update operation will instead run for every submodule.
But, what about those two command line options? The first,
--init, makes sure the submodules have been correctly initialized. The second,
--recursive, makes sure to check out all submodules, including submodules of submodules.
The two options are short-hand for running these two commands separately:
$ git submodule init$ git submodule update --recursive
submodule update causes the registered submodules to be updated to match what's expected in the parent repository. It does this by cloning any missing submodules and checking out files related to the SHA-1 commit hash.
The difference between HTTPS and SSH URLs in Submodules
So far we've used SSH URLs for the submodule repositories. This sort of URL requires more authentication for the user of the repository. A user of SSH URLs must have their SSH key registered with the remote Git server, and if not they'll get error messages.
$ git clone firstname.lastname@example.org:robogeek/parent.git Cloning into 'parent'... Could not create directory '/var/lib/jenkins/.ssh'. The authenticity of host 'github.com (188.8.131.52)' can't be established. ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Failed to add the host to the list of known hosts (/var/lib/jenkins/.ssh/known_hosts). email@example.com: Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
This was performed by a user ID which did not have SSH keys, and was certainly not registered with the Git server. This means its nonexistent SSH key cannot be validated, and in general that user ID is prohibited from cloning the repository. That's what the messages show us.
Your repository might solely support developers who have SSH authentication, and who will be committing changes to the repositories. If so, it is appropriate to require SSH key authentication.
But if the repository is to be shared with the public, who obviously have not registered their SSH key with your repository, you must use a different Git URL which allows unauthenticated cloning. The simplest is to use an HTTPS URL.
$ git clone https://github.com/robogeek/parent.git Cloning into 'parent'... remote: Enumerating objects: 6, done. remote: Counting objects: 100% (6/6), done. remote: Compressing objects: 100% (4/4), done. remote: Total 6 (delta 0), reused 6 (delta 0), pack-reused 0 Unpacking objects: 100% (6/6), 546 bytes | 182.00 KiB/s, done. $ cd parent/ $ ls child main.html $ ls child/
But, the submodules are not checked out. If we try to update the submodules:
$ git submodule update --init --recursive child ... firstname.lastname@example.org: Permission denied (publickey). fatal: Could not read from remote repository. ...
We get the same error messages as before. The issue is the SSH URL used in the submodule configuration.
Some repositories, like
akashacms/akashacms-website, must be open for anyone to clone the repository with no prior setup. Hence, HTTPS URLs are required for submodules of such sites.
If the submodule had been setup this way:
$ git submodule add https://github.com/robogeek/child.git
Then the submodule configuration instead has HTTPS URLs, and
submodule update command will execute correctly. But this creates a different problem if you want to push commits from the
child repository to its origin.
Pushing commits to a Git repository requires authentication, of course. With the SSH URL, the authentication is the SSH key. When using an HTTPS URL, authentication is handled through other means.
In the olden days of GitHub, we would be prompted for a user-name and password, and then be good to go. But, security needs have dictated change. Today if we try this, we learn that on August 13, 2021, GitHub removed support for password authentication on HTTPS URLs. The blog post for that announcement says we are supposed to use a Personal Access Token instead. The documentation to do this is straightforward.
Fortunately getting that token is easy, and well documented. As soon as we generate the personal access token, this works. The documentation says the token can be used in place of the password that is requested when using the HTTPS URL. We're also instructed to give the token a time limit, and limited access rights.
How to push changes in Submodule to its repository
We've just been talking about pushing commits to a submodule repository, without discussing how to do that. Silly us, jumping ahead of ourselves.
In a correctly cloned
parent repository, where the
child repository is correctly checked out, type these commands:
$ cd child $ touch about.html $ git add . $ git commit -a -m 'Add about page' $ git push fatal: You are not currently on a branch. To push the history leading to the current (detached HEAD) state now, use git push origin HEAD:<name-of-remote-branch>
The last fails with a somewhat inscrutable message. This message is telling you that the submodule repository is in a detached head state. In fact, it does not have a currently checked out branch.
To learn how to fix this, let's start with a freshly checked out repository.
$ git clone --recurse-submodules email@example.com:robogeek/parent.git parent5
This sets up a new clone of the repository, with all submodules checked out.
cd parent5/child git pull You are not currently on a branch.
This confirms that the newly initialized repository is not on a branch, and is in the detached head state. On the GitHub/GitLab/etc repository, determine the name of the main branch. This might be
master or something else.
$ git checkout main Previous HEAD position was a7192d1 Initial revision Switched to branch 'main' Your branch is up to date with 'origin/main'.
We've checked out the
main branch, which is the same as the main branch on the GitHub repository. This updates the HEAD of the repository, and indicates it is copacetic with
origin/main. All is good.
$ touch news.html $ git add . $ git commit -a -m 'Add news page' [main 6a1f930] Add news page 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 news.html $ git push Enumerating objects: 3, done. Counting objects: 100% (3/3), done. Delta compression using up to 4 threads Compressing objects: 100% (2/2), done. Writing objects: 100% (2/2), 254 bytes | 254.00 KiB/s, done. Total 2 (delta 0), reused 0 (delta 0), pack-reused 0 To github.com:robogeek/child.git 001f8e0..6a1f930 main -> main
We proceed with adding our new file, and we can simply use
git push to send the change to the GitHub server.
How to update a Submodule if its repository has changed
In the previous section we pushed a commit to our submodule. Shouldn't we be able to make a new clone of the repository, and see that change?
git clone --recurse-submodules firstname.lastname@example.org:robogeek/parent.git parent6 ... cd parent6/child ls child.html
Okay, that didn't work as expected. We expected
parent6/child to contain the files we pushed to the
child repository in the previous step. What's up? This has to do with the SHA-1 commit recorded in the parent repository. If you view the
parent repository on GitHub, you'll find the commit hash hasn't changed. Remember that when we clone the repository, the submodules are checked out relative to the SHA-1 commit hash.
The cure is to run this command in the
$ cd .. $ git submodule update --recursive --remote Submodule path 'child': checked out '6a1f93010a657c3453ea82b9d69d41462402638c' $ ls child/ about.html child.html news.html
child repository is updated to the latest commit in its repository. Notice that the SHA-1 has changed.
$ git status -s M child
Further, we have a change to commit in the
parent repository, which is the updated SHA-1 commit hash.
$ git commit -a -m 'Update submodule references' [main 1356324] Update submodule references 1 file changed, 1 insertion(+), 1 deletion(-) $ git push Enumerating objects: 3, done. Counting objects: 100% (3/3), done. Delta compression using up to 4 threads Compressing objects: 100% (2/2), done. Writing objects: 100% (2/2), 314 bytes | 314.00 KiB/s, done. Total 2 (delta 0), reused 0 (delta 0), pack-reused 0 To github.com:robogeek/parent.git b5c3802..1356324 main -> main
This updates the GitHub repository. Let's verify that we can now correctly clone the repository:
$ git clone --recurse-submodules email@example.com:robogeek/parent.git parent7 Cloning into 'parent7'... ... $ ls parent7/child/ about.html child.html news.html
And, indeed, it is correctly checked out.
Deleting submodule configuration from a Git repository
For the last task to learn about, we might have decided submodules are way more complex than we want to deal with. Or there might be another reason to delete the submodule configuration.
Whatever your reason, let's learn how to remove a submodule from a Git repository, and then commit that change to the GitHub repository.
We should start by examining the traces that store the submodule configuration:
$ cat .gitmodules [submodule "child"] path = child url = firstname.lastname@example.org:robogeek/child.git $ cat .git/config [core] ... [submodule] active = . [remote "origin"] url = email@example.com:robogeek/parent.git fetch = +refs/heads/*:refs/remotes/origin/* [branch "main"] remote = origin merge = refs/heads/main [submodule "child"] url = firstname.lastname@example.org:robogeek/child.git $ ls .git/modules/ child
There are three things here, the
.gitmodules file, the
.git/config file, and a directory in
$ git submodule deinit child Cleared directory 'child' Submodule 'child' (email@example.com:robogeek/child.git) unregistered for path 'child'
This removes the entry from
.git/config and in other ways ensures that
git submodule init and other commands will not act on the child.
$ git rm child/ rm 'child'
This removes the directory containing the submodule, and removes the entry from
At this point
.git/modules/child still exists:
rm -rf .git/modules/child/
And, this is the old school way of removing that directory. It's not clear why that isn't deleted by the other commands.
$ git status -s M .gitmodules D child $ git commit -a -m 'Remove child submodule' [main 48380f1] Remove child submodule 2 files changed, 4 deletions(-) delete mode 160000 child $ git push Enumerating objects: 3, done. Counting objects: 100% (3/3), done. Delta compression using up to 4 threads Compressing objects: 100% (2/2), done. Writing objects: 100% (2/2), 258 bytes | 258.00 KiB/s, done. Total 2 (delta 0), reused 0 (delta 0), pack-reused 0 To github.com:robogeek/parent.git 1356324..48380f1 main -> main
We have changes to commit. We commit those changes, and push them to the repository. And, if you go to GitHub you'll find the submodule reference is gone.
We have learned about an important part of Git which most of us don't use. Using submodules we can cause the contents of one repository to be embedded within another.
There are several possible reasons to do this, which we discussed earlier. It is a potentially powerful tool, but like so many powerful tools it must be used with care.
It is always best to automate administrative processes. In a case like this, all that's required is shell scripts (or the like). The more complex the process, the more important it is to automate it, to limit the chance you'll forget a step or two in an arcane sequence of commands.
Here's a few useful references: