Git Submodules vs Git Subtrees

The number one issue I’ve seen when people start using Git is dealing with submodules in existing projects. Recently I’ve been considering moving everything to subtrees, but I don’t see that as a direct replacement. In this post I explain why.

Why use Submodules or Subtrees?

Every organisation has code that is shared between projects, and submodules and subtrees prevent us from duplicating code across those projects, avoiding the many problems that arise if we have multiple versions of the same code.

Subtrees vs Submodules

The simplest way to think of subtrees and submodules is that a subtree is a copy of a repository that is pulled into a parent repository while a submodule is a pointer to a specific commit in another repository.

This difference means that it is trivial to push updates back to a submodule, because we’re just pushing commits back to the original repository that is pointed to, but more complex to push updates back to a subtree, because the parent repository has no knowledge of the origin of the contents of the subtree.

It also means that subtrees are much easier for other people to come and pull, as they are just part of the parent repository.

So an ultra-dumbed-down ELI5 comparison of submodules to subtrees could be:

  • Submodules are easier to push but harder to pull – This is because they are pointers to the original repository
  • Subtrees are easier to pull but harder to push – This is because they are copies of the original repository

I will elaborate on this, so pardon the simplification.

A brief overview of git submodules

Adding a submodule

If I wanted to add a submodule to an existing git repository I’d run something like this:

$ git submodule add https://github.com/mowen/awesomelib lib/awesomelib
Cloning into ‘lib/awesomelib’…
remote: Counting objects: 11, done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 11 (delta 0), reused 11 (delta 0)
Unpacking objects: 100% (11/11), done.
Checking connectivity... done.

If I then ran git status I’d see this:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.

Changes to be committed:
  (use "git reset HEAD <file>…" to unstage)

    new file:   .gitmodules
    new file:   lib/awesomelib

The .gitmodules file has been created, and it’s contents will be:

[submodule “lib/awesomelib”]
      path = lib/awesomelib
      url = https://github.com/mowen/awesomelib

So the three key consequences of the submodule add are:

  1. The .gitmodules file has been added in the root of the repository, containing the path and URL for the added submodule.
  2. The lib/awesomelib folder now contains a full clone of the https://github.com/mowen/awesomelib repository. With one key difference…
  3. The .git folder for the submodule repository has been added in the .git/modules folder at .git/modules/lib/awesomelib rather than lib/awesomelib/.git. The location lib/awesomelib/.git contains a file with a single line gitdir: ../../.git/modules/lib/awesomelib pointing to the real .git folder (the nested repository’s alternative to a full-blown .git folder).

Both the advantage and disadvantage of submodules is that they can and should be treated as a repository of their own. They will need to be committed to separately, and can be branched separately. The lib/awesomelib directory in the example above should be treated as nothing more than a pointer to a particular SHA-1 in another repository.

You may already be able to see some of the issues that can occur if you ignore the fact that the submodule needs to be kept up to date:

  • Changes to the parent could be committed and pushed without having committed and pushed the changes to the submodule.
  • If a collaborator has modified and pushed changes to a submodule but you haven’t run git submodule update to update the submodule on your machine to their latest version, you may run git add -A and downgrade to your out of date version.

Pulling from a submodule

This is just a case of:

  1. Changing directory to the submodule repository
  2. Pulling from the remote
  3. Moving up again to the root of the parent repository
  4. Committing the pointer to the new HEAD commit of the submodule

Any changes from the last committed submodule commit will be listing as modified, and can be included in the next commit to the parent repository.

Pushing to a submodule

The only difference between making changes to code within a submodule directory and a regular directory is that we must commit and push to the submodule repository before then moving up a directory and committing the pointer to the new submodule commit and pushing that to the remote of the parent repository.

I think this needs a more detailed example, which I’ll start by adding a file to the submodule folder:

$ cd lib/awesomelib
$ touch hello.txt
$ git status
HEAD detached at 2c81f4f
Untracked files:
  (use "git add <file>..." to include in what will be committed)

    hello.txt

nothing added to commit but untracked files present (use "git add" to track)

When the contents of a submodule folder have been modified they appear as a single line if we run git status in the parent repository:

$ cd ..
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

    modified:   lib/awesomelib (untracked content)

no changes added to commit (use "git add" and/or "git commit -a")

This output from git status can be confusing, because it looks like only a single file has changed, when in fact there could be massive changes within the submodule directory.

If I see a modified submodule directory and I haven’t modified it myself, I tend to run git submodule update to ensure that the checked out code for the submodule is the version it’s expected to be.

If you don’t do that, you are likely to end up committing the the incorrect version of the submodule that is present in your working copy.

As the changes in this example are deliberate, we should commit them, by changing directory to lib/awesomelib to commit our changes, and then pushing them:

$ cd lib/awesomelib
$ git add -A
$ git status
HEAD detached at 2c81f4f
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    new file:   hello.txt
$ git commit -m "Test file."
[detached HEAD 6498362] Test file.
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 hello.txt

Ignore the “detached HEAD”, it’s not perfect, but not relevant to this example.

So I’ve created a new commit in the submodule, but I haven’t yet pushed. If I move up a directory, I will then be back in the parent repository, and I will see that the submodule has a new commit:

$ cd ..
$ git st
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   lib/awesomelib (new commits)

no changes added to commit (use "git add" and/or "git commit -a")

There’s nothing to stop me from committing this change in the parent, even though I haven’t pushed the submodule change to the remote. So I need to make sure that after a submodule commit I also push:

$ git push origin master
Counting objects: 62, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (62/62), 11.63 KiB | 0 bytes/s, done.
Total 62 (delta 22), reused 58 (delta 21)
To https://github.com/mowen/awesomelib

Now I’m safe to commit the submodule change in the parent repository:

$ cd ..
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   lib/awesomelib (new commits)

no changes added to commit (use "git add" and/or "git commit -a")
$ git add -A
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

   modified:   lib/awesomelib
$ git commit -m "Test file."
[master 0297f84] Test file.
 1 file changed, 1 insertion(+), 1 deletion(-)

And push it as normal:

$ git push origin master
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 310 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
To https://github.com/mowen/parentrepo

That may seem quite convoluted, but we are dealing with two separate repositories, so there is always going to be twice as much work.

The order in which you commit and push changes when working with submodules is so important that I consider it the golden rule of modifying submodules…

The golden rule of modifying submodules

Always commit and push the submodule changes first, before then committing the submodule change in the parent repository.

As mentioned above, a submodule is nothing but a pointer to a specific commit in an external repository, so how can you possibly commit and push a reference to that pointer if it doesn’t exist on a server somewhere, accessible by everyone’s parent repositories?

Without following this rule you can get into a confusing state in which the parent repository is pointing to a submodule commit that only exists on your local machine. The tooling should warn about this and reject the push, but I haven’t seen it happen yet.

Issues with Submodules

Issues with submodules tend to arise due to the poor tooling. As mentioned, I’ve found that it is necessary to manually run a git submodule update each time I pull updates and find that a submodule has been updated, and it’s also necessary when switching between branches. You can tell if it’s been updated because a clean checkout will say that the submodule has been modified.

If you don’t notice that you need to update the submodule, all it takes is a lazy git add -A or git commit -a and you’ve downgraded the submodule to the version you’ve had in your working copy all along. This stale submodule can cause the entire project to get into a mess.

If you define an alias which runs git submodule update after every single git pull then you will be safe, but a newbie is unlikely to do this.

A brief overview of git subtrees

Adding a subtree

The following call to git subtree will be roughly equivalent to the git submodule command above:

$ git subtree add --prefix lib/awesomelib https://github.com/mowen/awesomelib master --squash
git fetch https://github.com/mowen/awesomelib master
warning: no common commits
remote: Counting objects: 11, done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 11 (delta 0), reused 11 (delta 0)
Unpacking objects: 100% (11/11), done.
Resolving deltas: 100% (7/7), done.
From hhttps://github.com/mowen/awesomelib
 * branch            master     -> FETCH_HEAD
Added dir ‘lib/awesomelib’

This will clone the remote repository into the lib/awesomelib folder, and create two commits for it.

The first is the squashing down of the entire history of the remote repository that we are cloning:

commit 70a0b8b8e2c76d9bcfd00f8f935d11941d2937d8
Author: Martin Owen <martinowenuk@gmail.com>
Date:   Sat Apr 9 19:50:49 2016 +0100

    Squashed ‘lib/awesomelib/‘ content from commit d3abff6

    git-subtree-dir: lib/awesomelib
    git-subtree-split: d3abff6e5307227858d5323cf8aaf108c542ad2b

A merge commit for it, including the SHA-1 for it in the comment:

commit df09e101ac1bcb1e6d48cb4ab6b28c707b5b0402
Merge: cc78b8d 70a0b8b
Author: Martin Owen <martinowenuk@gmail.com>
Date:   Sat Apr 9 19:50:49 2016 +0100

    Merge commit '70a0b8b8e2c76d9bcfd00f8f935d11941d2937d8' as ‘lib/awesomelib’

If I run git status, I’ll see nothing, as git subtree will have created the commits for me and left the working copy clean. Also there will be nothing in the lib/awesomelib to indicate that the folder ever came from another git repository. And as with submodules, this is both an advantage and a disadvantage.

Pulling from a subtree

Pulling changes from the remote to the subtree isn’t difficult at all, and is very similar to the add:

$ git subtree pull --prefix lib/awesomelib https://github.com/mowen/awesomelib master --squash

You should be able to see that the parameters are exactly the same as the add, we’ve just changed the command to pull. The command will also create a similar set of commits to the earlier add.

So far so good.

Pushing to a subtree

Things get really tricky when we need to push commits back to the original repository. This is understandable because our repository has no knowledge of the original repository, and has to figure out how to prepare the changes so that they can be applied to the remote before it can push.

$ git subtree push --prefix lib/awesomelib https://github.com/mowen/awesomelib master
git push using:  https://github.com/mowen/awesomelib master
Counting objects: 3, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 325 bytes | 0 bytes/s, done.
Total 3 (delta 2), reused 0 (delta 0)
To https://github.com/mowen/awesomelib
   2c81f4f..f0a54ff  f0a54ff7151a05ae9408a45daba88164bd4ab8cd -> master

In my experience how long this takes to run depends on the amount of history in the parent repository, your OS, and your machine. I’ve seen it take so long when running the command in a large repository on Windows that I had to give up and go back to using submodules, but I’ve found it to work more quickly on OS X.

The implementation is visible at: https://github.com/git/git/blob/master/contrib/subtree/git-subtree.sh and the split command (run as part of a push) is what takes significant time, but I’ve not been able to determine exactly why.

Issues with Subtrees

After so many issues with submodules I had high hopes for subtrees, but was quite disappointed. For a start there is very little documentation. This text file is the best official documentation I’ve found, and everything else I know has come from either Stack Overflow or blog posts.

My other main issue is with the slow push speeds on Windows that I have mentioned, I’ve found it to be so bad that it has made subtrees unviable for me.

Summary

In my opinion subtrees are not a direct replacement for submodules. The way I believe you should split your shared code between subtrees and submodules is this:

  • Is the external repository something you own yourself and are likely to push code back to? Then use a submodule. This gives you the quickest and easiest way for you to push your changes back.
  • Is the external repository third party code that you are unlikely to push anything back to? Then use a subtree. This gives the advantage of not having to give people permissions to an extra repo when you are giving them access to the code base, and also reduces the chance that someone will forget to run a git submodule update.

If you think I’m a complete idiot who has totally misunderstood and misrepresented submodules or subtrees, please let me know in the comments.