06. Version Control (Git)

Dated Oct 16, 2020; last modified on Sun, 14 Mar 2021

Version Control Systems track changes to a folder and its contents in a series of snapshots. Each snapshot encapsulates the entire state of files/folders within a top-level directory.

Git’s interface is a leaky abstraction. While the interface is at times ugly, its underlying design and ideas are beautiful. A bottom-up explanation of Git therefore makes more sense.

Snapshots

A blob corresponds to a file, and it’s just a bunch of bytes. A tree corresponds to a directory.

A tree maps names to blobs and trees (allowing sub-directories). A snapshot is the top-level tree that is being tracked.

Modeling History: Relating Snapshots

A commit contains parents, metadata (e.g. message, author) and the top-level tree.

A history is a directed acyclic graph (DAG) of commits, where each commit refers to a set of parents (commits that preceded it).

Organizing commits using time-order is insufficient. For instance, how would we with a commit that descends from multiple parents, e.g. in a merge?

In Git, \(A \to B\) means that \(B\) comes before \(A\). The arrows point to the parent(s). It’s easy to misread this as commit \(A\) came before commit \(B\), but if we think about, how could commit \(A\) know about \(B\) ahead of time?

Commits are immutable. “Editing” the commit history creates entirely new commits and references get updated to point to the new commits.

Objects and Content-Addressing

All objects are content-addressed by their SHA-1 hash.

Given an input, the SHA-1 hash function produces a 160-bit hash value (that’s typically displayed as a 40-digit hex number). SHA-1 is not cryptographically secure.

Blobs, trees, and commits are all objects. Objects referencing other objects [efficiently] reference their hash, not their on-disk representation.

Git References

SHA-1’s aren’t human readable, hence the need for references, which are pointers to commits. References are mutable, e.g. updating a ref to point to a different commit.

The master reference usually points to the latest commit in the main branch. HEAD is a special reference for “where we currently are”.

All git commands map to some manipulation of the commit DAG by adding objects and adding/updating references.


The staging area is a mechanism to allow users to satisfy a user need: “Create a snapshot, but not of the current state of my working directory. Instead, let me specify what files should go into the snapshot.”

Aside: Exploring Content in Git

I have a commit (17bcc8c) that involved two files. git cat-file -p 17bcc8c5126a8aaea0e08f1d5093d2616246e2e7 gives:

tree 9eec88a379362014e11e90a663b983e13b042fdc
parent e88f29b1395d076ad8b984758345e2a6777847d8
author Chege Gitau <d.chege711@gmail.com> 1602853108 -0700
committer Chege Gitau <d.chege711@gmail.com> 1602853108 -0700

[CSS] Make all citation text smaller

The author may differ from the committer, e.g. Alice sends a patch to a project, and Bob, one of the core members, applies the patch. Both Alice and Bob will get credit.

Using the tree’s hash, git cat-file -p 9eec88a379362014e11e90a663b983e13b042fdc gives:

100644 blob b39f19d9a5954dc929d1b2f765c2e78ed2dca6b3    .gitignore
100644 blob 7995f1c22ba6510d585c36342f4319a501494565    .gitmodules
040000 tree ce56f3493f314cda8717189b41dc0769bdeed1ed    .vscode
100644 blob fd5b541a492ee1007db5ccae70a86b9154141c52    README.md
040000 tree f5b1aa13654d360771df3da98296ac958d537d88    archetypes
100644 blob 212563bd95ae431001e0bf93c41e702e784f2202    config.toml
040000 tree c507520f760cac5235055e6b6805556bc97e7941    content
040000 tree 2dee6eeac48f232c6b809589c301727dc5905e97    data
160000 commit 2aa94ae5b1161529b9cdfe8b7f62383a75d6f73c  dchege711.github.io
040000 tree c0aedf3af904988269422f85503de0b662025d00    layouts
100755 blob 0967229b76c32ef1f105aa0dcd9ee8db3d76bb76    publish_blog.sh
100755 blob 1328ff22cb52413c58f5738a153fc088b27dffd4    run_blog_server.sh
040000 tree 61e0f387bb3f6bd9bf8f9b715d3be2e7affc58a4    src
040000 tree fb86b904d22a21cfb04578209a5792cddc252dca    static

I’m surprised. I expected to see only 2 blobs: static/js/OrganizeCitations.js and static/css/main.css. Hmm… Passing -s (object size) instead of -p (pretty print) shows that 9eec88a is only 522 bytes.

git cat-file tree 9eec88a379362014e11e90a663b983e13b042fdc gives:

100644 .gitignore��٥�M�)Ѳ�e���ܦ�100644 .gitmodulesy���+�Q
X\64/C�IEe40000 .vscode�V�I?1Lڇ�A�i����100644 README.md�[TI.�}�̮p�k�TR40000 archetypes���eM6q�=������S}�100644 config.toml!%c���C࿓�p.xO"40000 content�Rv�R5^khUk�~yA40000 data-�n�ď#,k����r}Ő^�160000 dchege711.github.io*�J�)����b8:u��<40000 layouts���:���iB/�P=�b]100755 publish_blog.sh	g"�v�.���
͞��=v�v100755 run_blog_server.sh(�"�RA<X�s�?���}��40000 srca��?kٿ��q];���X�40000 static����*!ϰEx �W���%-�

Makes sense why we have 522 bytes. Jah bless Git!

References

  1. Topic 2: Version Control. MIT Computer Science. missing.csail.mit.edu .
  2. SHA-1 - Wikipedia. en.wikipedia.org .
  3. Git - Viewing the Commit History. git-scm.com .