The Need for Standardized Secret Scanning

Secrets in source code lead to vulnerabilities. Can we do a better?

It’s no secret that leaving credentials in source code is risky, especially when tools like GitHub make it easy to share code publicly with a single command. The major players have long had tooling to help prevent this like GitHub Secret Scanning, and GitHub also makes it possible (at least in theory) for third-party providers to join the program. Unfortunately, out of the countless platforms that use secrets for authentication, only 100 or so have partnered with GitHub’s program.

Part of the reason for this may be the onboarding process (there are also additional terms to agree to that aren’t public). Further, there are numerous services like GitHub that could leak secrets and even more potential platforms with secrets to leak–that’s O(n^2) complexity! Clearly there has to be a better way.

What if we could make secrets self-documenting? We’ll define a standardized format for secrets, one that would never be used for legitimate reasons, then we’ll make sure source code repositories can easily interpret the format to know where to notify of leaks. We also need to make sure there are no new security pitfalls, since platforms getting notified won’t have a relationship with GitHub (or another repository) anymore.

Designing a secret format

Secrets are more than just a password that a service checks (there’s a great fly.io article on all sorts of secret formats). They can also directly encode information or be used to sign requests. We need a format that’s flexible and allows any use cases platforms want. This format needs a few properties:

  • Mechanically identifiable (ideally with a RegEx)
  • Visibly sensitive (to a developer)
  • Tamper-resistant (resists basic attempts to bypass scanning)
  • Self-documenting (describes disclosure process)

Stakeholders will need to agree on a standard here, but one idea is to leverage Uniform Resource Identifiers (URIs) to have a standard:

credential://[key_material]@example.com

Here, key_material can be any URI-encoded content (a base64 token, for instance). example.com represents the domain the key refers to. This has the benefit of using an existing standard while also being flexible. For instance, Stripe currently prefixes keys with pk_, sk_, and test_ to signify public, secret, and test keys, respectively. The domain can encode this data visibly for users while still being a valid domain. For instance, public keys could be credential://[key]@public.key.stripe.com, and so on. Any developer can visibly identify the party and purpose of keys, and the actual key content is still flexible.

With a well-defined key format standard, platforms can make their keys scannable without registration. Win!

Why not existing formats?

Credit XKCD

(Credit XKCD 927)

With all the existing formats out there, why not leverage one of these? Formats are great when parties on both sides are using them (client and server, for instance). But secret scanning fundamentally occurs outside interpreting the token itself, so it makes sense for it to be a wrapper around the token. Requiring more complex formats could limit use cases, and also make it less likely for platforms and source repositories to implement.

Handling Secret Scanning

Next, we want to make sure source repositories can disclose results of key scans. At the most basic level, a disclosure strategy could look like this:

  1. GitHub finds a credential credential://[key_material]@example.com
  2. GitHub sends a POST to https://example.com/.well-known/secret-scanning with Body credential://[key_material]@example.com
  3. Service at https://example.com invalidates the key

This is a good start, but it could face several challenges. Let’s walk through a few and discuss solutions:

Inadvertent key disclosure

The key format and disclosure process make it likely that keys could get leaked to domain squatters. For instance, let’s say that a key credential://[key_material]@example.com is issued, but the developer pastes it into code as credential://[key_material]@example.co. A squatter on example.co would start receiving secrets for example.com!

One way we could fix this is by checksumming the rest of the credential. For instance, credential://[key_material]@example.com/[checksum], where [checksum] is some hash of the domain and key material.

Key disclosure endpoint

Previously, we just assumed the source repository would just infer where to send disclosures. Ideally, we’d like this to be flexible, and the most apt analog that comes to mind is DMARC reporting. Here, domains create a TXT record on their domain that describes if and how they want to receive notices of DMARC failures, and we can follow a similar format here. For instance:

example.com TXT v=SECRET_SCANNING_v1; action=(notify|reject); disclosure=https://api.example.com/api/v1/secret_scanning;

A repository can lookup this record when a secret is found, and even take differing actions depending on the record. For instance, an action=reject could force the repository to reject the push. The disclosure attribute defines where to send the disclosure notice. Source repositories should make sure to validate DNSSEC and follow other best practices when retrieving this record.

GitHub: Please do this!

GitHub (along with others) has done a great service to the community by scanning for and disclosing secrets. But we could have so much better! As a leader in the space, GitHub has the opportunity to bring platforms together and standardize secret scanning to make OSS software safer for all. While a standardized solution may not look exactly like what I’ve described above, I hope this can help guide the discussion towards making standardized secret scanning a reality.

Eric Pauley
Eric Pauley
PhD Student & NSF Graduate Research Fellow

My current research focuses on practical security for the public cloud.