How does Git work?

If you're like me and have less than twenty years of software engineering experience, the thought of a world without Git doesn't seem possible. When I started to research for this post, I almost fell out of my chair when I read that Git was created in 2005. It doesn't seem that long ago ... either that, or I'm simply getting old.

When I started programming, I asked myself a question I sometimes still ask myself today - How does git work? I often find myself being scared of certain Git commands. It's hard to know for sure whether I should rebase, or maybe merge. When you want to undo something, it's hard to know whether git revert is the best option.

I sometimes ask myself questions like what is the appropriate use case for a force push? There have definitely been a few occasions when a wrong Git command turned into a big deal. So, I decided to bite the bullet and learn what was going on under that magical hood. In this article, you'll learn how Git works, learn some history, and get some practical tips.

A brief history of Git

Git is a distributed version control system, which means that it can be used across multiple systems, including a centralized repo and server. Before distributed systems like Git, Subversion (SVN) was the most popular way to manage software changes over time. Unlike Git, Subversion is centralized rather than distributed. Git uses a remote repository to track changes and developers can use git push to update the remote with the changes on their local machine. Developers use branching and pull requests to keep track of changes to the codebase, making software development much more collaborative.

With SVN, your data is stored on a central server, and any time you check it out, you're checking out a single version of the repository. This presented problems with multiple developers working on the same codebase at the same time, as the copies on each developer's computer can become out of sync. Git has features to help avoid these conflicts, like identifying merge conflicts and aiding in their resolution.

While most of us remember Git as the first distributed version control system, it wasn't actually the first. Before Git, there was BitKeeper, which was a proprietary source control management system. Created in 1998, BitKeeper was spun up to solve some of the growing pains of the Linux project. It offered a free license for open-source projects, with the stipulation that developers could not create a competing tool while using BitKeeper plus one additional year.

With these constraints (on Linux developers!), I'm sure you can guess what happened. In the early-to-mid 2000s, there were a plethora of license complaints, and in 2005, the free version of BitKeeper was removed. This prompted Torvalds to quickly create Git, which he named after a British slang word that means "unpleasant person."

The project was taken over by Junio Hamano (a major contributor) after its original v0.99 release, and Junio remains the core maintainer of the project. Fun Fact: New features are still being developed for Git. The most recent version of Git was released in June 2024, and is version 2.46.0

If you want to read more about BitKeeper, check out the Wikipedia page, which is interesting even though it is no longer being developed.

Understanding Git

While Git has morphed into a full-fledged version control management system over the years, this wasn't the original intent. Linus Torvalds said the following on this topic:

In many ways, you can just see Git as a filesystem -- it's content-addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have zero interest in creating a traditional SCM (source control management) system.

Side note: In case you're wondering what Linus means by "content-addressable", it is a way to store information so that it can be retrieved based on content rather than the location in the filesystem. Most traditional local and networked storage devices are location-addressed, and Linus is highlighting that Git is not.

Under the hood, Git has two data structures:

a mutable index (i.e., a connection point between the object database and the working tree)
an immutable, append-only object database.

There are five types of objects in Git:

blob: this is the content of a file.
tree: this is the equivalent of a directory
commit: this links tree objects together to form a history
tag: this is a container that contains a ref to another object, as well as other metadata
packfile: zlib version compressed of various other objects

Each object has a unique name, which is a SHA-1 hash of its contents.

To better understand how all of this fits together, let's walk through a basic Git workflow. Start by creating a simple example project directory and run git init.

Trying it out

Open your terminal, and create a new directory. Then, run git init. You should then see something similar to the following output:

mkdir understanding-git
understanding-git git init
Initialized empty Git repository in /Users/juliekent/Documents/understanding-git/.git/
understanding-git git:(master)

I am sure you have done this many times as you begin new projects, but you may not have really cared to know what was actually in the newly created .git directory. Let's check it out!

If you run ls -a via your terminal, you will see the .git directory. By default, it is a hidden directory, which is why you need the -a flag for it to be shown. Move into this hidden directory by running cd .git , then run ls. You should see something like this:

ls
HEAD        config      description hooks       info        objects     refs

For this article, we will be focusing on the HEAD, objects, and refs directories. We will also run some commands so that we have some index files present, but we will do this later. The description file is only used by the GitWeb program. The config file is pretty straightforward, as it contains project configuration options. The info directory keeps a global exclude file for ignored patterns you don't want to track, which is based on the .gitignore file in the root of the project, which I'm sure most of you are already familiar with.

The Git objects directory

Let's start with exploring the objects directory. To see what is created by git init, run find .git/objects. You should see the following:

find .git/objects
.git/objects
.git/objects/pack
.git/objects/info

Next, let's create a file in the directory by running:

echo 'this is me' > myfile.txt

This command just creates a text file named myfile.txt with the contents "this is me".

Now, let's run the command to get a checksum hash from this file:

git hash-object -w myfile.txt

Your output should be a random mix of numbers and letters -- this is a SHA-1 checksum hash. If you're not familiar with SHA-1, it's worth investigating what it is and how it's used.

Next, copy your SHA-1, and run the following command:

git cat-file -p <insert your SHA here>

You should see "this is me", the contents of the text file that you created earlier. Cool! This is how content-addressable Git objects work. You can think of it as a key-value store where the key is the SHA-1, and the value is the contents of the file.

Next, let's dig deeper into how git works and write some new content to our original file by running:

echo 'this is not me' > myfile.txt

Then, run the hash-object command again to retrieve the hash:

git hash-object -w myfile.txt

You now have two unique SHA-1s for both versions of this file. If you want further proof, run find .git/objects -type f, and you should see both via your terminal window.

If you'd like to learn more about how other objects in Git work, I recommend following this Git objects tutorial.

The Git refs directory

Let's move onto refs. When running find .git/refs, you should see the following output:

understanding-git git:(master) ✗ find .git/refs
.git/refs
.git/refs/heads
.git/refs/tags

As we explained in the previous section about objects, we know that Git creates unique SHA-1 hashes for each one. Of course, we could run all of our Git commands utilizing each object's hash. For example, git show 123abcd, but this is unreasonable and would require us to remember the hash of every object. Fortunately, Git references (refs) help us connect objects to their hashes.

A reference is simply a file stored in .git/refs containing the hash of a commit object. Let's go ahead and commit our myfile.txt, so we can better understand how refs work. Go ahead and run git add myfile.txt and then commit the staged changes with git commit -m 'first commit'. You should see something like this:

git add myfile.txt
git commit -m 'first commit'
[master (root-commit) 40235ba] first commit
 1 file changed, 1 insertion(+)
 create mode 100644 myfile.txt

Now, let's navigate to the .git/refs/heads directory by running cd .git/refs/heads. From there, run cat master to output the file contents. You will see the SHA-1. Finally, run git log -1 master, which should output something similar to the following:

commit Unique SHA-1 (HEAD -> master)
Author: Julie <jkent2910@gmail.com>
Date:   Mon Aug 3 15:59:59 2020 -0500

   first commit

As you see from this example, Git branches are simply references. When we change the location of the master branch, all Git has to do is change the contents of the refs/heads/master file. When you make a new Git branch, you're making a new reference. Likewise, creating a new branch creates a new reference file with the commit hash.

If you want to undo a commit, you can run git revert with the SHA-1 of the commit. The git revert command will create a new commit on the branch with a diff that is exactly the opposite of the original commit. Alternatively, you could use an interactive rebase to remove the commit altogether with the following command:

git rebase master -i

Helpful hint: If you ever want to see all references, run git show-ref, which will list all references.

What is Git HEAD?

In Git, HEAD is a symbolic reference. You might wonder, when running git branch <branch>, how Git knows the SHA-1 of the last commit. Well, the HEAD file is usually a symbolic reference to your current Git branch.

You might be thinking to yourself, "You keep saying symbolic; what does that mean?" Great question! Symbolic means that it contains a pointer to another reference. If your head is spinning, I'm with you. It took me quite a bit of Googling and reading to finally understand what exactly HEAD is. Here is a great analogy, pulled from this article explaining Git Head

A good analogy would be a record player and the playback and record keys on it as the HEAD. As the audio starts recording, the tape moves ahead, moving past the head by recording onto it. The stop button stops the recording while still pointing to the point it last recorded, and the point that record head stopped is where it will continue to record again when record is pressed again. If we move around, the head pointer moves to different places; however, when Record is pressed again, it starts recording from the point the head was pointing to when Record was pressed.

Go ahead and run: cat .git/HEAD to output the file contents. You should see something like this:

cat .git/HEAD
ref: refs/heads/master

This makes sense because we are on the master branch. HEAD is, essentially, always going to be the reference to the last commit in the currently checked-out Git branch.

Helpful Tip: You can run git diff HEAD to view the difference between HEAD and the working directory.

How Git merge works

One of Git's most used collaboration features is branching. It's common for a developer to "cut" a branch off of the master branch when working on a feature, then later merge that code back into the master branch. This sounds pointless on its own, but provides a powerful option when multiple developers are working on a codebase simultaneously. It's common for developers to git push their code to a remote repository and git merge using an interface to open a pull request.

Two (or more) developers can work on independent features without getting in each other way because of branches. When one developer is finished, they can merge their branch into the master branch, which adds their commits to it. The other developer can then bring their branch up to date with the master branch, whether by rebase or merge, and then merge their branch into the master branch when they're ready.

Functionally, a git merge is rather simple. When you run git merge, git changes the commit that HEAD points to. Before the merge, HEAD points to the latest commit on the default or master branch. After the merge, HEAD points to the latest commit of the branch. This is simple under the hood but has a net result of the main branch having all of its original commits plus the new commits from the branch. If you want to compare two branches without merging them, you can use git diff with a branch name to compare them. If the two branches have a merge conflict, a developer will have to instruct Git which branch's changes to keep

How Git stash works

You can get a lot of use out of Git without ever using git stash, but you'd be doing yourself a disservice. Running git stash gives you a way to temporarily store changes away from your working directory without committing them. This lets you git push to a remote repository without any changes you're not quite ready to commit. The most frequent reason I need to use git stash is when I need to git fetch or git pull changes from the remote repository and I have local changes not ready to commit. I can't pull with uncommitted changes, and I might not be ready for a git push. Using git stash helps me store those changes so that I can git pull.

You can tell what changes you have that haven't been committed by running git status.

If you run git stash, those changes will be put on top of a stack of stashed changes for the repository. You can see all your stashed changes by running:

git stash list

Answering how does git work with git stash Using git stash and git stash pop

When you want to get changes back from your stash so that they return to your working directory, you have a few options. The stash stack is a first-in-first-out data structure, so the first change you can access is the most recently stashed change. If your run git stash apply, it will put those changes in your working directory. If you run git stash pop however, it will put the changes in your working directory and remove it from the stack, giving you access to the next stashed change if you need it. Either way, running the git status command can show you a confirmation that your changes are back.

Moving beyond answering the question - how does Git work?

We have covered a lot in this post! We explored how Git works and how to use Git effectively. We've learned a bit of fun history regarding how Git came about and examined the main plumbing that makes all of the magic happen! We also explored how the most common ways to use the git command line interface work.

If you want to continue to dive deeper into Git, as well as better understand how some of the common commands work, I highly recommend the book titled "Pro Git", which is available for free here. We've also written extensively on some incredibly useful Git tips and tricks if you're looking to move into more advanced Git commands quickly.