To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.
If you use Git on the command-line, you have almost certainly used
git log to peruse your project’s history. But you may not be as familiar with its cousin,
git shortlog is used to summarize the output produced by
git log. For example, many projects (including Git) use
git shortlog -ns to produce a list of unique contributors in a release, along with the number of commits they authored, like this:
$ git shortlog -ns v2.38.0.. | head -10 166 Junio C Hamano 118 Taylor Blau 115 Ævar Arnfjörð Bjarmason 43 Jeff King 26 Phillip Wood 21 René Scharfe 15 Derrick Stolee 11 Johannes Schindelin 9 Eric Sunshine 9 Jeff Hostetler [...]
We’ve talked about
git shortlog in the past, most recently when 2.29 was released to show off its more flexible
--group option, which allows you to group commits by fields other than their author or committer. For example, something like:
$ git shortlog -ns --group=author --group=trailer:co-authored-by
would count each commit to its author as well as any individuals in the
git shortlog became even more flexible by learning how to aggregate commits based on arbitrary formatting specifiers, like the ones mentioned in the pretty formats section of Git’s documentation.
One neat use is being able to get a view of how many commits were committed each month during a release cycle. Before, you might have written something like this monstrosity:
$ git log v2.38.0.. --date="format:%Y-%m" --format="%cd" | sort | uniq -c
--date="format:%Y-%m" tells Git to output each date field like
--format="%cd" tells Git to output only the committer date (using the aforementioned format) when printing each commit. Then, we sort the output, and count the number of unique values.
Now, you can ask Git to do all of that for you, by writing:
$ git shortlog v2.38.0.. --date="format:%Y-%m" --group='%cd' -s 2 2022-08 47 2022-09 405 2022-10 194 2022-11 5 2022-12
git shortlog output a summary where the left-hand column is the number of commits attributed to each unique group (in this case, the year and month combo), and the right-hand column is the identity of each group itself.
Since you can pass any format specifier to the
--group option, the flexibility here is limited only by the pretty formats available, and your own creativity.
When you want to tell Git to remove unreachable objects (those which can’t be found by walking along the history of any branch or tag), you might run something like:
$ git gc --cruft --prune=5.minutes.ago
That instructs Git to divvy your repository’s objects into two packs: one containing reachable objects, and another containing unreachable objects modified within the last five minutes. This makes sure that a
git gc process doesn’t race with incoming reference updates that might leave the repository in a corrupt state. As those objects continue to age, they will be removed from the repository via subsequent
git gc invocations. For (many) more details, see our post, Scaling Git’s garbage collection.
Even though the
--prune=<date> mechanism of adding a grace period before permanently removing objects from the repository is relatively effective at avoiding corruption in practice, it is not completely fool-proof. And when we do encounter repository corruption, it is useful to have the missing objects close by to allow us to recover a corrupted repository.
In Git 2.39,
git repack learned a new option to create an external copy of any objects removed from the repository:
--expire-to. When combined with
--cruft options like so:
$ git repack --cruft --cruft-expiration=5.minutes.ago -d --expire-to=../backup.git
any unreachable objects which haven’t been modified in the last five minutes are collected together and stored in a packfile that is written to
../backup.git. Then, objects you may be missing after garbage collection are readily available in the pack stored in
These ideas are identical to the ones described in the “limbo repository” section of our Scaling Git’s garbage collection blog post. At the time of writing that post, those patches were still under review. Thanks to careful feedback from the Git community, the same tools that power GitHub’s own garbage collection are now available to you via Git 2.39.
On a related note, careful readers may have noticed that in order to write a cruft pack, you have to explicitly pass
--cruft to both
git gc and
git repack. This is still the case. But in Git 2.39, users who enable the
feature.experimental configuration and are running the bleeding edge of Git will now use cruft packs by default when running
If you’ve been following along with the gradual introduction of sparse index compatibility in Git commands, this one’s for you.
In previous versions of Git, using
git grep --cached (to search through the index instead of the blobs in your working copy) you might have noticed that Git first has to expand your index when using the sparse index feature.
In large repositories where the sparse portion of the repository is significantly smaller than the repository as a whole, this adds a substantial delay before
git grep --cached outputs any matches.
Thanks to the work of Google Summer of Code student, Shaoxuan Yuan, this is no longer the case. This can lead to some dramatic performance enhancements: when searching in a location within your sparse cone (for example.,
git grep --cached $pattern -- 'path/in/sparse/cone'), Git 2.39 outperforms the previous version by nearly 70%.
This one is a little bit technical, but bear with us, since it ends with a nifty performance optimization that may be coming to a Git server near you.
Before receiving a push, a Git server must first tell the pusher about all of the branches and tags it already knows about. This lets the client omit any objects that it knows the server already has, and results in less data being transferred overall.
Once the server has all of the new objects, it ensures that they are “connected” before entering them into the repository. Generally speaking, this “connectivity check” ensures that none of the new objects mention nonexistent objects; in other words, that the push will not corrupt the repository.
One additional factor worth noting is that some Git servers are configured to avoid advertising certain references. But those references are still used as part of the connectivity check. Taking into account the extra work necessary to incorporate those hidden references into the connectivity check, the additional runtime adds up, especially if there are a large number of hidden references.
In Git 2.39, the connectivity check was enhanced to only consider the references that were advertised, in addition to those that were pushed. In a test repository with nearly 7 million references (only ~3% of which are advertised), the resulting speed-up makes Git 2.39 outperform the previous version by roughly a factor of 4.5.
As your server operators upgrade to the latest version of Git, you should notice an improvement in how fast they are able to process incoming pushes.
Last but not least, let’s round out our recap of some of the highlights from Git 2.39 with a look at a handful of new security measures.
Git added two new “defense-in-depth” changes in the latest release. First,
git apply was updated to refuse to apply patches larger than ~1 GiB in size to avoid potential integer overflows in the apply code. Git was also updated to correctly redact sensitive header information with
GIT_CURL_VERBOSE=1 when using HTTP/2.
If you happen to notice a security vulnerability in Git, you can follow Git’s own documentation on how to responsibly report the issue. Most importantly, if you’ve ever been curious about how Git handles coordinating and disclosing embargoed releases, this release cycle saw a significant effort to codify and write down exactly how Git handles these types of issues.
To read more about Git’s disclosure policy (and learn about how to participate yourself!), you can find more in the repository.