Plugging Git Leaks: Preventing and Fixing Information Exposure in Repositories

Private keys, third-party-API keys, database master passwords, personally identifiable information... The consequences of exposing such sensitive information can be so dramatic for your organization and your users that leaving it out of your source code is sound advice, if only in an effort to reduce your attack surface.

What I find interesting is that, despite its many benefits, the widespread adoption of Git may have increased the risk of accidentally burying sensitive information in version-control repositories.

Don't get me wrong: I love Git, and I wouldn't trade it for the world! Just hear me out.

Committing more than intended is easy

Git's main mechanism for staging changes, git-add, is arguably too powerful. For instance, running git add . will almost indiscriminately stage the entire contents of the current working directory. Even with a proper gitignore hygiene, accidentally staging more than intended is deceptively easy.

Moreover, in my experience at least, many developers fail to review what they've just staged and instead immediately proceed to create a new commit. Therefore, it's not unusual for developers to unwittingly commit sensitive information along with the intended changes.

Detecting accidentally committed stuff can be hard

There's something more pernicious and deceptive in Git's ecosystem, though: because most code collaboration tools such as GitHub, GitLab, Bitbucket, etc. handle review at the pull-request level—as opposed to the commit level—sensitive data can easily escape the scrutiny of even the most diligent code reviewers! The following example may suffice to convince you.

Picture Bob, a developer working on a feature branch off of the master branch. Bob accidentally stages and commits sensitive.txt, which (guess what) contains sensitive data.

git add .
git commit -m "Implement functionality foo"

Bob realizes his mistake but, having only a superficial understanding of how Git works, elects to solve the problem by creating a new commit (with a far from descriptive commit message) in order to stop tracking the problematic file:

git rm --cached sensitive.txt
git commit -m "Fix some stuff"

Bob, after tacking on a few more commits, then creates a pull request and asks Alice (a more seasoned Git user than Bob) to review his contribution.

The commit graph looks something like this:

* f7b5695 (master)
* cc24446 Implement functionality foo
* 541d8e9 Fix some stuff
* ...
* 330459d (HEAD -> feature) Ta da!

At this point, most of you readers will likely realize that Bob failed to solve the problem: the sensitive stuff is still present in the repo's history, in the diff between commits cc24446 and 541d8e9. But wait!

To review Bob's pull request, Alice is presented with a diff between feature and master, which is only a summary of Bob's changes! As a result, unless Alice reviews the diff between each pair of commits that Bob created—and few developers have the will or luxury to do that in practice—she will not realize that Bob introduced sensitive data in the repo's history. Satisfied, Alice gives Bob the thumbs-up and approves his pull request. Sensitive data has just been buried in the repo's history!

Fast-forward a few months... The team decides, perhaps as a way to give back to the community or for the sake of transparency, to open-source their hitherto closed-source project. They do a quick pass to clean up the code, but fail to audit it for sensitive information (secrets, PII, etc.) lying dormant in the repo's history. Their sensitive info has now leaked out of the organization and is accessible to the entire world!

Someone's hunting for your sensitive information

Your development team may fail to notice the presence of sensitive stuff in your repositories, but that doesn't mean that other external actors won't find it. In fact, some people have become adept at finding secrets and other forgotten treasures buried deep in the history of public repositories. They have a battery of tools at their disposal to assist their search. Here are just a few of them:

Ethical hacker @TomNomNom came up with this shell oneliner, which dumps the contents of a repository's object database, and whose output you can pipe to grep, to great effect.
The aptly named trufflehog allows you to search a repository's history for strings of high entropy or that match the signature of secrets of third-party services (e.g., AWS).
gitrob attemps to identify potentially sensitive files present in a repository.

Although some organizations offer a financial incentive to ethical hackers for reporting information-exposure issues to them, there's no question that malicious actors, with way more nefarious goals than merely pocketing a bug bounty, are also scouring your public repositories for sensitive information.

Sensitive stuff in your repo: What you should do

So you've found sensitive data in your repository's history. Now what? The answer largely depends on whether revocation is an option. For instance, if the sensitive stuff consists of an API key for a third-party service, you can probably revoke that API key. Do that, adopt a proper secret-management solution, and move on.

But what if the sensitive stuff in question happens to be not a secret but personally identifiable information (PII)? Perhaps you thought using your relatives' names and addresses as test data at the prototype stage would be funny, but you're not laughing now...

Obviously, you cannot revoke PII or recall it if it has already leaked. However, what you can do is rewrite your repo's history to wipe all traces of it in your repository, thereby preventing it from leaking any further. Even though rewriting Git history has serious ramifications, including but not limited to provoking the ire of your open-source contributors, you gotta do what you gotta do...

Two tools come in handy for purging a repository of sensitive information:

Git itself provides a subcommand named filter-branch. Make sure to read the man page carefully before invoking git filter-branch, though: it can be very destructive, so much so that the Pro Git Book rightfully describes it as "the nuclear option" of history rewriting.
The BFC Repo-Cleaner is an external tool similar in spirit to git-filter-branch, but which excels at more specialized tasks.

Finally, a formal apology to the owner(s) of the leaked PII is probably in order, too.

How to keep sensitive stuff out in the first place

There is no perfect solution, but a couple of tools and practices can help.

git-secrets and Talisman work in a similar way. Both tools are meant to be installed in local repositories as pre-commit hooks. If they detect that a prospective commit may contain sensitive information, they will reject the commit and alert you to the problem.
Keep pull requests small. Allocate time to inspect intermediate diffs during code review; don't misconstrue this piece of advice as a license to squash commits willy-nilly, though. If you find this practice too time-consuming, reduce the size of your pull requests even further.
Put yourself in the attacker's shoes: using trufflehog and friends, periodically audit your repos for sensitive information.
Shift security further to the left: why not strive for automation and integrate a tool like trufflehog into your CI/CD pipeline?

Conclusion

Keeping sensitive information out of Git repositories is surprisingly hard, but all is not lost: a modicum of discipline, together with a handful of specialized tools, can go a long way.

Also, with the rise of DevSecOps, I wouldn't be surprised to witness the advent of more sophisticated tools in the near future. Watch this space!