Spotting base64 encoded JSON, certificates, and private keys

https://ergaster.org/til/base64-encoded-json/

177•jandeboevrie•2h ago

Comments

delecti•2h ago

Oh that's nifty. Spotting base64 encoded strings is easy enough (and easy enough to test that I give it a shot if I'm even vaguely curious), but I'd never looked at them closely enough to spot patterns.

morkalork•2h ago

After copy and pasting enough access tokens into various tools you pick up on it pretty fast.

rgovostes•2h ago

There is a Base64 quasi-fixed point:

    $ echo -n Vm0 | base64
    Vm0w

It can be extended indefinitely one character at a time, but there will always be some suffix.

thaumasiotes•2h ago

Note that the suffix will grow in length with the input, making it less and less interesting.

(Because the output is necessarily 8/6 the size of the input, the suffix always adds 33% to the length.)

o11c•1h ago

For reference, a program to generate the quasi-fixed point from scratch:

  #!/usr/bin/env python3
  import base64

  def len_common_prefix(a, b):
      assert len(a) < len(b)
      for i in range(len(a)):
          if a[i] != b[i]:
              return i
      return len(a)

  def calculate_quasi_fixed_point(start, length):
      while True:
          tmp = base64.b64encode(start)
          l = len_common_prefix(start, tmp)
          if l >= length:
              return tmp[:length]
          print(tmp[:l].decode('ascii'), tmp[l:].decode('ascii'), sep='\v')
          # Slicing beyond end of buffer will safely truncate in Python.
          start = tmp[:l*4//3+4] # TODO is this ideal?

  if __name__ == '__main__':
      final = calculate_quasi_fixed_point(b'\0', 80)
      print(final.decode('ascii'))

This ultimately produces:

  Vm0wd2QyUXlVWGxWV0d4V1YwZDRWMVl3WkRSV01WbDNXa1JTVjAxV2JETlhhMUpUVmpBeFYySkVUbGho

syncsynchalt•16m ago

From the other direction, you'd call it a tail-eating unquine?

metalliqaz•2h ago

Something similar pops up if you have to spend a lot of time looking at binary blobs with a hex editor. Certain common character sequences become familiar. This also leads to choosing magic numbers in data formats that decode to easily recognized ASCII strings. I'm sure if I worked with base64 I'd be choosing something that encoded nicely into particular strings for the same purpose.

skissane•1h ago

Related trick I've learnt: binary data containing lots of 0x40 may be EBCDIC text, or binary data containing embedded EBCDIC strings – 0x40 is EBCDIC space character

Probably not a very useful trick outside of certain specific environments

metalliqaz•23m ago

so, uhh... insurance or banking?

shortrounddev2•2h ago

I discovered this when I created a JWT system for my internship. I got really good at spotting JWTs, or any base64 encoded json payloads in our Kafka streams

cogman10•2h ago

I don't really love this. It just feels so wasteful.

JWT does it as well.

Even in this example, they are double base64 encoding strings (the salt).

It's really too bad that there's really nothing quite like json. Everything speaks it and can write it. It'd be nice if something like protobuf was easier to write and read in a schemeless fashion.

reactordev•2h ago

We just need to sacrifice n*field_count to a header describing the structure. We also need to define allowed types.

Muromec•1h ago

>Everything speaks it and can write it.

asn.1 is super nice -- everything speaks it and tooling is just great (runs away and hides)

dlt713705•1h ago

What’s wrong with this?

The purpose of Base64 is to encode data—especially binary data—into a limited set of ASCII characters to allow transmission over text-based protocols.

It is not a cryptographic library nor an obfuscation tool.

Avoid encoding sensitive data using Base64 or include sensitive data in your JWT payload unless it is encrypted first.

zokier•1h ago

JSON is already text based and not binary so encoding it with base64 is bit wasteful. Especially if you are going to just embed the text in another json document.

And of course text-based things themselves are quite wasteful.

pak9rabid•1h ago

Exactly. Using base64 as an obfuscation tool, or (shudder) encryption is seriously misusing it for what it was originally intended for. If that's what you need to do then avoid using base64 in favor for something that was designed to do that.

xg15•1h ago

I think it's more the waste of space in it all. Encoding data in base64 increases the length by 33%. So base64-encoding twice will blow it up by 33% of the original data and then again 33% of the encoded data, making 69% in total. And that's before adding JSON to the mix...

And before "space is cheap": JWT is used in contexts where space is generally not cheap, such as in HTTP headers.

cogman10•40m ago

Precisely my thoughts.

You have to ask the question "why are we encoding this as base64 in the first place?"

The answer to that is generally that base64 plays nice with http headers. It has no newlines or special characters that need special handling. Then you ask "why encode json" And the answer is "because JSON is easy to handle". Then you ask the question "why embed a base64 field in the json?" And the answer is "Json doesn't handle binary data".

These are all choices that ultimately create a much larger text blob than needs be. And because this blob is being used for security purposes, it gets forwarded onto the request headers for every request. Now your simple "DELETE foo/bar" endpoint ends up requiring a 10kb header of security data just to make the request. Or if you are doing http2, then it means your LB will end up storing that 10kb blob for every connected client.

Just wasteful. Especially since it's a total of about 3 or 4 different fields with relatively fixed sizes. It could have been base64(key_length(1byte)|iterations(4bytes)|hash_function(1byte)|salt(32bytes)) Which would have produced something like a 51 byte base64 string. The example is 3x that size (156 characters). It gets much worse than that on real systems I've seen.

zokier•1h ago

> It's really too bad that there's really nothing quite like json

messagepack/cbor are very similar to json (schemaless, similar primitive types) but can support binary data. bson is another similar alternative. All three have implementations available in many languages, and have been used in big mature projects.

derefr•1h ago

> It'd be nice if something like protobuf was easier to write and read in a schemeless fashion.

If you just want a generic, binary, hierarchical type-length-value encoding, have you considered https://en.wikipedia.org/wiki/Interchange_File_Format ?

It's not that there are widely-supported IFF libraries, per se; but rather that the format is so simple that as long as your language has a byte-array type, you can code a bug-free IFF encoder/decoder in said language about five minutes.

(And this is why there are no generic IFF metaformat libraries, ala JSON or XML libraries; it's "too simple to bother everyone depending on my library with a transitive dependency", so everyone just implements IFF encoding/decoding as part of the parser + generator for their IFF-based concrete file format.)

What's IFF used in? AIFF; RIFF (and therefore WAV, AVI, ANI, and — perhaps surprisingly — WebP); JPEG2000; PNG [with tweaks]...

• There's also a descendant metaformat, the ISO Base Media File Format ("BMFF"), which in turn means that MP4, MOV, and HEIF/HEIC can all be parsed by a generic IFF parser (though you'll miss breaking some per-leaf-chunk metadata fields out from the chunk body if you don't use a BMFF-specific parser.)

• And, as an alternative, there's https://en.wikipedia.org/wiki/Extensible_Binary_Meta_Languag... ("EBML"), which is basically IFF but with varint-encoding of the "type" and "length" parts of TLV (see https://matroska-org.github.io/libebml/specs.html). This is mostly currently used as the metaformat of the Matroska (MKV) format. It's also just complex enough to have a standalone generic codec library (https://github.com/Matroska-Org/libebml).

My personal recommendation, if you have some structured binary data to dump to disk, is to just hand-generate IFF chunks inline in your dump/export/send logic, the same way one would e.g. hand-emit CSV inline in a printf call. Just say "this is an IFF-based format" or put an .iff extension on it or send it as application/x-iff, and an ecosystem should be able to run with that. (And just like with JSON, if you give the IFF chunks descriptive names, people will probably be able to suss out what the chunks "mean" from context, without any kind of schema docs being necessary.)

gabesullice•2h ago

I love this post style. Never stop learning friend!

zavec•1h ago

Yeah, people are being snarky and saying it's obvious, but it was new to me! I guess I'm not staring at base64 all that often. It's a neat trick though, now I'm going to pay attention next time I have an opportunity to use it.

mmastrac•2h ago

I built a JWT support library at work (https://github.com/geldata/gel-rust/tree/master/gel-jwt) and I can confirm that JWTs all sound like "eyyyyyy" in my head.

isoprophlex•2h ago

eyy lmao

Muromec•1h ago

>JWTs all sound like "eyyyyyy" in my head.

"eeey bruh, open the the API it's me"

proactivesvcs•1h ago

"Is that you again, punk zip?" when seeing the first few bytes of a zip file.

ragmodel226•19m ago

Probably shouldn’t call Phil Katz a punk

ragmodel226•16m ago

Also, MZ in exe is Mark Zbikowski.

layer8•26m ago

It's like how all certificates sound like "miiiiii".

schoen•14m ago

"It's miiiii! And I can prove it!"

snickerdoodle12•2h ago

Isn't this obvious to anyone who has seen a few base64 encoded json strings or certificates? ey and LS are a staple.

mmastrac•2h ago

`MII` for RSA private keys.

Muromec•2h ago

MII is not RSA, it's an opening header of asn1 structure encoded to DER -- 30 82 0x which is basically "{" when which can be pretty much anything from x509 certificate to private keys fro ECDSA.

Actual RSA oid is somewhere in the middle.

mmastrac•1h ago

True, but for the most part, RSA keys are the only keys that anyone encounters that start with long SEQUENCEs requiring two-byte lengths.

`eY` could be any JSON, but it's most likely going to be a JWT.

Neither is a perfect signal, but contextually is more likely correct than not.

Muromec•1h ago

That depends on the kind of abyss you are staring into. Mine had plenty of non-RSA keys, certificates (which are of course two-byte length all the time) and CMS containers.

FelipeCortez•2h ago

I thought so too, but xkcd 1053 / lucky 10000, I guess! I knew about ey but not LS

InfoSecErik•2h ago

IMO depends on your career. I did a lot of pentesting with Burp Suite so I was able to (forced to) pick it up.

SkyPuncher•1h ago

Probably is, but I still found it to be a fun tidbit.

I work with this stuff often enough to recognize something that looks like a key or a hash. I don't work with it often enough to have picked up `ey` and `LS`.

athorax•2h ago

Base64 encoded yaml files will also be LS-prefixed if they have the document marker (---)

thibaultamartin•1h ago

That's right, I've added an errata to clarify. Thanks for the heads up!

gnabgib•2h ago

You can spot Base64 encoded JSON.

The PEM format (that begins with `-----BEGIN [CERTIFICATE|CERTIFICATE REQUEST|PRIVATE KEY|X509 CRL|PUBLIC KEY]-----`) is already Base64 within the body.. the header and footer are ASCII, and shouldn't be encoded[0] (there's no link to the claim so perhaps there's another format similar to PEM?)

You can't spot private keys, unless they start with a repeating text sequence (or use the PEM format with header also encoded).

[0]: https://datatracker.ietf.org/doc/html/rfc7468

ctz•1h ago

The other base64 prefix to look out for is `MI`. `MI` is common to every ASN.1 DER encoded object (all public and private keys in standard encodings, all certificates, all CRLs) because overwhelmingly every object is a `SEQUENCE` (0x30 tag byte) followed by a length introducer (top nibble 0x8). `MII` is very very common, because it introduces a `SEQUENCE` with a two byte length.

Muromec•1h ago

I for one wait for the day when quantum computers will break all the encryption forever so nobody will have to suffer broken asn1 decoders, plaintext specifications of machine-readable formats and unearned aura of arcane art that surrounds the whole thing.

ctz•48m ago

asn1 enjoyers can also look forward to the sweet release of death. though if you end up in hell you might end up staring at XER for the rest of eternity

thibaultamartin•1h ago

Thanks for pointing it out! I've added an errata to the blog post

yahoozoo•2h ago

babby's first base64

cluckindan•2h ago

On mobile, the long rows in the code blocks blow up the layout.

Muromec•2h ago

After staring one time too much at base64-encoded or hex-encoded asn1 I started to believe that scene in the Matrix where operator was looking at raw stream from Matrix at his terminal and was seeing things in it.

cestith•1h ago

Years ago I was part of a group of people I knew who could read and edit large parts of sendmail.cf by hand without using m4. Other people who had to deal with mail servers at the time certainly treated it like a superpower.

Muromec•1h ago

Where I work right now superpower of the day is pressing ctrl-r in the terminal.

nickdothutton•1h ago

A significant part of my 1st ever job consisted of editing sendmail.cf’s by hand. Occasionally had to defer to my boss at the time for the real mind bending stuff. I now believe that he was in fact a non-human alien.

quesera•1h ago

In some ways, I miss those days.

Spending hours wrangling sendmail.cf, and finally succeeding, felt like a genuine accomplishment.

Nowadays, things just work, mostly. How boring.

PeterWhittaker•35m ago

In 1989, my Toronto-based team was at TJ Watson for the final push on porting IBM's first TCP/IP implementation to MVS. Some of our tests ran raw, no RACF, no other system protections. I was responsible for testing the C sockets API, a very cool job for a co-op.

When one of my tests crashed one of those unprotected mainframes, two guys who were then close to my age now stared at an EBCDIC core dump, one of them slowly hitting page down, one Matrix-like screen after another, until they both jabbed at the screen and shouted "THERE!" simultaneously.

(One of them hand delivered the first WATFOR compiler to Yorktown, returning from Waterloo with a car full of tapes. I have thought of him - and this "THERE!" moment - every time I have come across the old saw about the bandwidth of a station wagon.)

curiousObject•1h ago

Yikes! It would be smart to bury these strings in an ad hoc obfuscation so they aren’t so obvious.

It doesn’t even need to be much better than ROT13. Security by obscurity is good for this situation.

Faaak•1h ago

Nitpick, but enclosing the first string in single quotes would make the reading better:

$ echo '{"' | base64

$ echo "{\"" | base64

dhosek•1h ago

Kind of reminds me of a junior being amazed when I was able to read ascii strings out of a hex stream. Us old folks have seen a lot.

iou•1h ago

“Welcome to the party, pal!”

netsharc•1h ago

Good knowledge, now explain why it's like that.

{" is ASCII 01111011, 00100010

Base64 takes 3 bytes x 8 bits = 24 bits, groups that 24 bit-sequence into four parts of 6 bits each, and then converts each to a number between 0-63. If there aren't enough bits (we only have 2 bytes = 16 bits, we need 18 bits), pad them with 0. Of course in reality the last 2 bits would be taken from the 3rd character of the JSON string, which is variable.

The first 6 bits are 011110, which in decimal is 30.

The second 6 bits are 110010, which in decimal is 50.

The last 4 bits are 0010. Pad it with 00 and you get 001000, which is 8.

Using an encoding table (https://base64.guru/learn/base64-characters), 30 is e, 50 is y and 8 is I. There's your "ey".

Funny how CS people are so incurious now, this blog post touches the surface but didn't get into the explanation.

perching_aix•1h ago

I'd be very hesitant to consider this as some runaway symbol of "CS people being incurious now" over the author simply not being this deeply invested in this at the time of writing in the context of their discovery, especially since it almost certainly doesn't actually matter for them beyond the pattern existing, if even that does.

netsharc•24m ago

> it almost certainly doesn't actually matter for them beyond the pattern existing, if even that does.

https://web.cs.ucdavis.edu/~rogaway/classes/188/materials/th...

positisop•27m ago

I think CS grads often skip the part of how something actually works and are happy with abstractions.

syncsynchalt•18m ago

I think the audience already understands why it works, it's more the knowing there's a relatively small set of mnemonics for these things that's interesting. "eyJ" for JSON, "LS0" for dashes (PEM encoding), "MII" for the DER payload inside a PEM, and so on.

I've been doing this a long time but until today the only one I'd noticed was "MII".

appreciatorBus•15m ago

That’s really a leap about the writer’s interest.

They could just as easily have felt the underlying reason was so obvious it wasn’t worth mentioning.

I know how base64 encoding works but had never noticed the pattern the author pointed out. As soon as read it, I ubderstood why. It didn’t occur to me that the author should have explained it at a deeper level.

karel-3d•1h ago

I debugged way too many JWT tokens

I know eyJhbG by heart

karel-3d•1h ago

they technically don't need to begin like that! JWT is JSON and is therefore infamously vague... but in practice they for some reason always begin with "alg" so always like eyJhbG

xg15•1h ago

Has anyone tried to send a JWT token with the fields in a different order (e.g. a long key first and key ID and algorithm behind) and see how many implementations will break?

syncsynchalt•14m ago

I didn't even realize I knew that string, but I recognized it immediately from your post.

Sophira•1h ago

Mathematically, base64 is such that every block of three characters of raw input will result in four characters of base64'd output.

These blocks can be considered independent of each other. So for example, with the string "Hello world", you can do the following base64 transformations:

* "Hel" -> "SGVs"

* "lo " -> "bG8g"

* "wor" -> "d29y"

* "ld" -> "bGQ="

These encoded blocks can then be concatenated together and you have your final encoded string: "SGVsbG8gd29ybGQ="

(Notice that the last one ends in an equals sign. This is because the input is less than 3 characters, and so in order to produce 4 characters of output, it has to apply padding - part of which is encoded in the third digit as well.)

It's important to note that this is simply a byproduct of the way that base64 works, not actually an intended thing. My understanding is that it's basically like how if you take an ASCII character - which could be considered a base 256 digit - and convert it to hexadecimal (base 16), the resulting hex number will always be two digits long - the same two digits, at that - even if the original was part of a larger string.

In this case, every three base 256 digits will convert to four base 64 digits, in the same way that it would convert to six base 16 digits.

zokier•1h ago

nitpick but ascii would be base128, largest ascii value is 0x7f which in itself is a telltale if you are looking at hex dumps.

Sophira•59m ago

Yeah, I was aware of that, but I figured it was the easiest way to explain it. It's true that "character representation of a byte" is more accurate, but it doesn't roll off the tongue as easily.

Sophira•32m ago

By the way, I would guess that this is almost certainly why LLMs can actually decode/encode base64 somewhat well, even without the help of any MCP-provided tools - it's possible to 'read' it In a similar way to how an LLM might read any other language, and most encoded base64 on the web will come with its decoded version alongside it.

tetha•1h ago

Reminds me of 1213486160[1]

Besides that, I just spent way too much time figuring out this is an encrypted OpenTofu state. It just looked way too much like a terraform state but not entirely. Tells ya what I spend a lot of time with at work.

This is probably another interesting situation in which you cannot read the state, but you can observe changes and growth by observing the ciphertext. It's probably fine, but remains interesting.

1: https://rachelbythebay.com/w/2016/02/21/malloc/

koolba•1h ago

Well duh. It’s a deterministic encoding. Does not matter if it’s base64, hex, or even rot13.

Is this the state of modern understanding of basic primitives?

calibas•55m ago

The encoded JSON string is going to start with "ey", unless there's whitespace in the first couple characters.

Also, it seem like the really important point is kind of glossed over. Base64 is not a kind of encryption, it's an encoding that anybody can easily decode. Using it to hide secrets in a GitHub repo is a really really dumb thing to do.

Why are Windows semiannual updates named H1 and H2?

How Much Is a Pension Worth?

Denmark zoo asks for people to donate their pets to feed its predators

A New Theme for Emacs

Ask HN: How Fit3d Works?

Show HN: A new way to read company 10-Ks

Generalization Gap in over‑Parameterized Models

GPT-OSS Playground

Be Careful with Go Struct Embedding

Age Assurance on X

SoftBrowse – Hide Reels, Explore, and Feed on Instagram (But Keep DMs)

The mystery of Alice in Wonderland syndrome

OpenAI's new open weight (Apache 2) models are good

The modern USD account built for global businesses

China Is Choking Supply of Critical Minerals to Western Defense Companies

A Turning Point in Colon Cancer: Young People Are Finding It Earlier

Why Should We Worry About Declining Birth Rates?

A first look at GPT-OSS-120B's coding ability

Embracing the Model Context Protocol in practice: An engineering deep-dive

AGI Blueprint - 424 pages– Visual Thought AGI Link

Canadian Court Rejects Reverse Class Action Lawsuit Against BitTorrent Pirates

uBlock Origin still works in Chrome 139

ZK Proofs Are Getting Easier but the Airdrop Game Remains Tricky

User Interfaces in Agentic CLI Tools: What Developers Need

Writing culture

Hammer Time: Scientists Have Figured Out Why Hammerheads Love Eating Other Shark

What's the "Points" of Agile, Anyway?

The New York Post Is Expanding to LA, Launching the California Post Next Year

Writing code was never the bottleneck

Show HN: I made a playground and editor for generative AI models

Why are Windows semiannual updates named H1 and H2?

How Much Is a Pension Worth?

Denmark zoo asks for people to donate their pets to feed its predators

A New Theme for Emacs

Ask HN: How Fit3d Works?

Show HN: A new way to read company 10-Ks

Generalization Gap in over‑Parameterized Models

GPT-OSS Playground

Be Careful with Go Struct Embedding

Age Assurance on X

SoftBrowse – Hide Reels, Explore, and Feed on Instagram (But Keep DMs)

The mystery of Alice in Wonderland syndrome

OpenAI's new open weight (Apache 2) models are good

The modern USD account built for global businesses

China Is Choking Supply of Critical Minerals to Western Defense Companies

A Turning Point in Colon Cancer: Young People Are Finding It Earlier

Why Should We Worry About Declining Birth Rates?

A first look at GPT-OSS-120B's coding ability

Embracing the Model Context Protocol in practice: An engineering deep-dive

AGI Blueprint - 424 pages– Visual Thought AGI Link

Canadian Court Rejects Reverse Class Action Lawsuit Against BitTorrent Pirates

uBlock Origin still works in Chrome 139

ZK Proofs Are Getting Easier but the Airdrop Game Remains Tricky

User Interfaces in Agentic CLI Tools: What Developers Need

Writing culture

Hammer Time: Scientists Have Figured Out Why Hammerheads Love Eating Other Shark

What's the "Points" of Agile, Anyway?

The New York Post Is Expanding to LA, Launching the California Post Next Year

Writing code was never the bottleneck

Show HN: I made a playground and editor for generative AI models

Spotting base64 encoded JSON, certificates, and private keys

Comments