How Git Uses Merkle Trees for Fast Verification

Quick Answer: Git doesn't compare files line-by-line. Instead, it uses a Merkle tree data structure. Every file's content is hashed into a "blob," which is then hashed into a "tree," culminating in a top-level "commit" hash. By simply comparing two commit hashes, Git instantly verifies if entire directories are perfectly identical in milliseconds.

Ever wonder how Git can look at two directories containing hundreds, or even thousands, of files and instantly know if they match? If you or I were to write a script to do this naively, we'd probably loop through every single file, comparing the contents bit by bit. That is an O(N) operation. It works, but as the repository grows, it gets painfully slow.

Yet, Git does this in milliseconds. Let's break down exactly how it pulls this off without reading every file every time you type a command.

How does Git verify directory contents so fast?

Git relies on a cryptographic data structure called a Merkle tree. Instead of scanning raw file contents, Git assigns a unique hash to every file and directory, rolling them up into a single master hash.

Think of it like managing a massive library of books. If you needed to verify that two identical library buildings contained the exact same books, reading every page would take lifetimes. Instead, imagine you generate a unique barcode for every book based on its text. Then, you generate a barcode for the shelf based on the books it holds, and finally, a master barcode for the whole building based on its shelves.

If the master barcodes match, the libraries are perfectly identical. Git does exactly this, but with your source code.

How do Git blobs, trees, and commits work together?

When you create a commit, Git stores your data in a hierarchy of objects: blobs for file content, trees for directory structures, and commits for the snapshot metadata. Each level generates a hash that depends entirely on the hashes below it.

Let's say your team is building a standard web app. When you run a commit, Git structures your data in three distinct layers:

Blobs: Git takes the raw content of your files and hashes them. This creates a "blob" (binary large object). The blob only cares about the raw content, completely ignoring the filename.
Trees: Next, Git creates a tree object to represent your directory structure. This tree contains the filenames and points to the generated blob hashes. Git then hashes this entire tree object.
Commits: Finally, the commit object wraps around the root tree. It pairs the root tree hash with metadata like the author name, timestamp, and parent commit. Then, the commit itself gets hashed.

Git Object	What It Represents	What It Hashes
Blob	Raw file content	Only the file's text/data (ignoring filename)
Tree	Directory structure	Filenames, permissions, and hashes of underlying blobs/trees
Commit	Repository snapshot	Author, timestamp, parent commit hash, and root tree hash

Why does changing one file change the commit hash?

Because the Merkle tree structure creates a strict chain of cryptographic dependencies. A modified file produces a new blob hash, which forces the parent tree to generate a new hash, causing a ripple effect straight to the commit object.

This is the real power of the Merkle tree design. If you fix a single typo in a CSS file deep within your project, Git doesn't rescan the rest of your codebase. That modified CSS file gets a brand new blob hash. Because the tree (directory) holding that file now points to a new blob hash, the tree's hash changes. This bubbles all the way up to the top-level commit hash.

Therefore, if two commit hashes are identical, you have a mathematical guarantee that every single file and folder underneath them is the exact same content. If the commit hashes are different, you instantly know something shifted in the repository, and Git simply follows the changed tree hashes down the branches to find exactly which file was modified.

Frequently Asked Questions

What hashing algorithm does Git use?

Historically, Git uses the SHA-1 algorithm to generate object hashes. While SHA-1 has known theoretical vulnerabilities in a strict security context, Git uses it primarily for data integrity and consistency. Newer versions of Git are actively transitioning to the more secure SHA-256 algorithm.

Does Git hash the filename or just the content?

Git blobs only hash the raw file content. The actual filename and file permissions are stored and hashed inside the tree object that points to the blob. This is why if you rename a file without changing its contents, Git recognizes it as a rename—the underlying blob hash remains exactly the same.

What happens to old blobs when a file is changed?

Git never overwrites your old blobs. It creates an entirely new blob object for the changed content. The old blob safely remains in your local database, allowing you to easily checkout older commits. If a blob becomes completely unreachable, it is eventually cleaned up by Git's background garbage collection.

How Git Uses Merkle Trees for Fast Verification

How does Git verify directory contents so fast?

How do Git blobs, trees, and commits work together?

Why does changing one file change the commit hash?

Frequently Asked Questions

What hashing algorithm does Git use?

Does Git hash the filename or just the content?

What happens to old blobs when a file is changed?

Comments

More from this blog

Handling ISO Currency Codes in Software

How Merkle Trees Verify Large Files

Storing Exchange Rates for Multi-Currency Systems

Merkle Trees Explained: How We Verify Large Files

Command Palette

How does Git verify directory contents so fast?

How do Git blobs, trees, and commits work together?

Why does changing one file change the commit hash?

Frequently Asked Questions

What hashing algorithm does Git use?

Does Git hash the filename or just the content?

What happens to old blobs when a file is changed?

Comments

More from this blog