A Nibble of Git's Object Store

A Nibble of Git's Object Store

Power and efficiency through content-addressable storage and delta compression

Nibble: a small piece of food bitten off. In computing: half a byte of information. Every nibble explains a computing science or software engineering idea or system in five minutes.

Git, created by Linus Torvalds in 2005, has become ubiquitous. This nibble describes the architecture of Git's object store, where Git stores your files, directories, and commits, right there in the .git folder. The underlying ideas work together beautifully.

face made out of gear

Image by author via Stable Diffusion

A content-addressable object store

Git's object store keeps arbitrary pieces of data, called objects. Objects are just bytes - the store doesn't care about the format of the data.

The store has two operations. You can store objects using git hash-object -w <filename>. This command calculates the SHA-1 hash of the file's content and stores the compressed object in the object store using the hash as the key. You can retrieve the object using git cat-file -p <hash>.

For example:

$ echo "some text" > file.txt

$ git hash-object -w file.txt
7b57bd29ea8afbdeb9bac64cf7074f4b531492a8

$ git cat-file -p 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8
some text

Even if you use Git every day, you probably haven't used these commands. You can use them to store pretty much anything in a Git object store.

The idea of storing values by their hash is called content-addressable storage, which also made an appearance in A Nibble of Content-Defined Chunking. It's a powerful idea because it allows deduplication. If you add two identical files to Git, or rename or move an existing file, the file's content is only stored once.

Blob, tree, and commit objects

Git's more familiar commands, like git add and git commit, support common source control operations, like committing files and reverting files to a previous state. They all work by reading and writing to the object store. Git uses a few different types of objects to track file and directory contents, and metadata like commit messages.

First, blob objects hold the contents of files. The content of file.txt above was stored as a blob. Blobs do not store the name of the file, or any other information about it.

Second, tree objects represent a directory. A tree object lists the file and directory names it contains. For each file, it additionally lists the hash of the blob object where the file's content is stored. For each folder, it lists the hash of another tree object. Similar to a blob, which is a snapshot of a file, a tree object allows Git to recreate a snapshot of a directory.

Third, commit objects store information such as author, date, and commit message, and the hash of a tree object that represents the snapshot of the directory at that commit.

When committing a single file to an empty repository, Git creates one object of each type in the store:

$ git add -A

$ git commit -m "first commit"
[master (root-commit) 212c57e] first commit
 1 file changed, 1 insertion(+)
 create mode 100644 file.txt

Note Git shows the hash of the commit object. The commit object points to a tree object, which points to a blob object. We can inspect the objects using git cat-file:

$ git cat-file -p 212c57e
tree fc5a436ddf54bd82f2da31dde8898cc56b51ee7a
author Kurt <email> 1670074483 +0000
committer Kurt <email> 1670074483 +0000

first commit

$ git cat-file -p fc5a
100644 blob 7b57bd29ea8afbdeb9bac64cf7074f4b531492a8    file.txt

$ git cat-file -p 7b57
some text

As you add commits, files and directories over time, the objects create a directed acyclic graph, or DAG, which holds the history in commit objects, directories in tree objects, and file contents in blob objects.

Here's an animation to illustrate. On the left is a checked-out directory, which contains just a README initially. Then two changes are committed. On the right is the commit graph. Every vertex in the graph corresponds to one object in the store - red vertices are commits, green are trees, and blue are blobs. Thanks to content-based hashing, each commit reuses as many objects as possible from previous commits.

On-disk layout of the store

Objects in the object store are kept in the .git/objects directory. Each object is stored as a separate file, and the SHA-1 hash is encoded in the path. The .git/objects directory is a shallow tree structure:

$ tree .git/objects
.git/objects
├── 21
│   └── 2c57e34401cc2c6a36191bdda3d8d3a71de4ec
├── 7b
│   └── 57bd29ea8afbdeb9bac64cf7074f4b531492a8
├── fc
│   └── 5a436ddf54bd82f2da31dde8898cc56b51ee7a

The first two letters of each key are a directory name, and the rest make up the file name. Concatenate them together to get the hash of the object.

Why not store the objects as a flat list of files in .git/objects? That spells trouble if there are a large number of objects. Some file systems are slow to list directories with large numbers of files. Many file systems have a hard limit on the number of files in one directory. To sidestep these problems, Git uses a shallow tree instead.

Delta compression in pack files

As explained, git hash-file stores the content of a whole file into an object. Thanks to content-based addressing, files with identical content are stored only once, regardless of their name or location in the directory tree. But what about nearly identical files? Large parts likely remain unchanged. As a result, the object store will amass lots of duplicate data over time. Git has one last trick up its sleeve to deal with this.

When you first commit files that are nearly identical to existing files, (or use git hash-file) they are simply added as new objects. Git uses the term loose objects for objects stored in .git/objects as separate files.

Here's a demonstration - first, let's add a larger file file.txt to a new Git repository, to make it easier to see what's going on.

$ head -c 10K </dev/urandom > file.txt

$ git add file.txt

$ git commit -m "Initial commit"
[master (root-commit) d946bf7] Initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file.txt

$ tree -h .git/objects/
.git/objects/
├── [4.0K]  7e
│   └── [  53]  83c181b79bb609bae262bb5fcd2a12407d9a32
├── [4.0K]  c2
│   └── [ 10K]  d27d0d659b876b59ca77023b063f9cdf736dbd
├── [4.0K]  d9
│   └── [ 136]  46bf71a38c30d03e088f2ddfe320cdd0850a85
├── [4.0K]  info
└── [4.0K]  pack

As before, the new file adds three objects: a commit, a tree and a blob (c2d27). After a small change to the file, Git stores three entirely new objects:

$ echo "new line" >> file.txt

$ git add file.txt

$ git commit -m "Appended a line"
[master f70c00e] Appended a line
 1 file changed, 0 insertions(+), 0 deletions(-)

$ tree -h .git/objects/
.git/objects/
├── [4.0K]  01
│   └── [ 10K]  d35f8a800ada8dfccfd54a5615624614bf9656
├── [4.0K]  7e
│   └── [  53]  83c181b79bb609bae262bb5fcd2a12407d9a32
├── [4.0K]  c2
│   └── [ 10K]  d27d0d659b876b59ca77023b063f9cdf736dbd
├── [4.0K]  c9
│   └── [  53]  73906cc48da970575264bc9725a1d80042f82f
├── [4.0K]  d9
│   └── [ 136]  46bf71a38c30d03e088f2ddfe320cdd0850a85
├── [4.0K]  f7
│   └── [ 167]  0c00eec3a5741d5f04e31e5274597b2b39bc0c
├── [4.0K]  info
└── [4.0K]  pack

Even though the new version of file.txt is almost the same as the previous one, there are two 10KB blob objects in the store (c2d27 from before and 01d35).

To deal with this, Git periodically packs objects together in a pack file. Pack files end up in .git/objects/pack. The loose object files are deleted after packing, as they are no longer necessary.

When packing Git searches for nearly identical objects, and stores them using delta compression in the pack file. Delta compression stores differences, or deltas, between objects instead of complete snapshots. If the difference between two objects is smaller than their size, this saves space.

Git uses delta compression by picking a base object, which is stored in its entirety. Then nearly identical objects are stored as a series of “insert bytes” and “append bytes” operations on top of the base object. Git tries various combinations of base and derived objects, and keeps the combination that results in the least amount of storage space.

Git packs:

  1. before you push, to make data transfer efficient;

  2. when the number of loose objects in the .git/objects directory reaches a threshold;

  3. when you trigger it manually using git gc.

Let's run git gc:

$ git gc
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 12 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)

$ tree -h .git/objects/
.git/objects/
└── [4.0K]  pack
    ├── [1.2K]  pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.idx
    └── [ 10K]  pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack

All the loose objects are gone, and we now have some files under .git/objects/pack. Since the pack file is only 10KB, Git did a good job of detecting and removing the duplicate content. You now understand what the output of git gc means: Git found 6 objects in the store, used 12 threads to delta compress them, and found one object that was stored as a delta to an existing object.

Let's see what's in the pack file:

$ git verify-pack -v .git/objects/pack/pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack
f70c00eec3a5741d5f04e31e5274597b2b39bc0c commit 254 160 12
d946bf71a38c30d03e088f2ddfe320cdd0850a85 commit 205 131 172
01d35f8a800ada8dfccfd54a5615624614bf9656 blob   10249 10263 303
c973906cc48da970575264bc9725a1d80042f82f tree   36 47 10566
7e83c181b79bb609bae262bb5fcd2a12407d9a32 tree   36 47 10613
c2d27d0d659b876b59ca77023b063f9cdf736dbd blob   6 17 10660 1 01d35f8a800ada8dfccfd54a5615624614bf9656
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-f2a79ca1900a5c8130748c363392c6cb74c9898d.pack: ok

Great, we still have 6 objects, except now they're stored more efficiently, both in the number of files and the file size. The line for blob c2d27, the previous version of file.txt, shows that it's stored as a delta to 01d35, the latest version of file.txt. The third column indicates object size, making it clear that only 01d35 is stored in its entirety.

Recap

Git's object storage system uses the following key ideas:

  1. Content-addressable storage to deduplicate identical content.

  2. Delta compression to deduplicate nearly identical content.

  3. A DAG of commit, tree and blob objects to store the history of directories and files.

Thanks for reading! I write one nibble every month. For more, subscribe to my newsletter, and follow me on Twitter or Mastodon.

References

Git internals: Git Objects and Pack files

Git compression of Blobs and Pack files