Better Binary Batter: Mixing Base64 and Uint8Array

In the same way that all of life's diversity can be boiled down to just three domains — Archaea, Bacteria, and Eukarya — every file on your computer can be split out into one of just two groups: binary and text.

These labels are admittedly unfortunate because, ultimately, all digital content is binary data, just bits of 0s and 1s. But forget about that for a second. In this context, it's about what the bits represent.

Those are gross oversimplifications, but should do.

Storing Binary In Text

Sometimes it is necessary to embed binary data inside a text file. Not usually, of course. But _some_times.

For example, in order to streamline the installation process for JS Mate Poe — a fun Javascript port of the beloved 16-bit Screen Mate Poe Windows application — the binary dependencies — a PNG image sprite and three MP3 audio files — are embedded inside the script (text) file.

The trick is employing a little binary alchemy, temporarily transmuting the wild binary data binary into textual data binary for storage purposes.

Sticking with our Javascript example, there are two main ways to go about this.

Most front-end web developers facing such a situation would automatically reach for Base64, a common encoding scheme that plucks entries from a 64-letter alphabet to represent six bits of binary data.

Support for Base64 encoding runs deep through the web technology ecosystem, but internally, most browsers natively store binary data in a special TypedArray consisting of unsigned 8-bit integers — a Uint8Array. Here, the "alphabet" are decimals ranging from 0–255, where each decimal represents one byte of binary information.

Consequences of Storage

Data has weight, but so too does encoding.

Take, for example, the image sprite from JS Mate Poe. Its natural weight, when stored as a binary PNG file, is 30,454 bytes.

If we were to Base64-encode the binary data from that PNG image, we'd get the following:

iVBORw0KGgoAAAANSUhEUgAAAoAAAAG4CAMAAADi0qZMAAAAhFBMVEUAAP8AAAD/3Mf//Nn/////xKH/9pH/ev9zAHOzALP/qHX…

Encoded thusly, the wild binary data is tame enough for text time, but unfortunately it clocks in at 40,609 bytes, a third larger than the original file. Base64-encoded data is always a third larger than the raw binary equivalent because a single byte from the Base64 alphabet can only represent six bits of information. Because a byte actually contains eight bits, this effectively means two bits out of every byte are lost to bureaucracy.

But such is the cost of encoding. How does the browser-approved Uint8Array stack up?

[137,80,78,71,13,10,26,10,0,0,0,13,73,72,68,82,0,0,2,128,0,0,1,184,8,3,0,0,0,226,210,166,76,0,0,0,1…

Much worse: 108,596 bytes.

Rather than being a third bigger, it is more than three times the size of the original file. What the hell?

In this case, the exact penalty will vary, because unlike Base64 which always uses one letter to represent six bits, Uint8Arrays use numbers of variable length to represent whole bytes. A small number, like "5", only requires one byte of storage to represent one byte of information. That's a solid trade! But for a large number, like "137", three bytes of storage are required to represent one byte of information. And that sucks!

If your only concern is storage space, then the choice is clear: Base64 is much more efficient.

(While outside the scope of this article, if our theoretical text file were subsequently encoded with Gzip or Brotli, the overall script sizes would decrease, but their relative sizes would remain more or less the same. Gibberish just doesn't compress very well.)

Consequences at Runtime

But what about the runtime costs? JS Mate Poe is an animation library. In order to trick our monkey brains into believing separate, static pictures are a single, connected movement, the transitions have to happen quickly.

Here, it turns out, is where Uint8Array data shines. To understand why, it is first necessary to state that Javascript does not execute text; it executes code.

At runtime, the first thing the browser does is parse the text file into data the Javascript VM can run. As data, a Javascript string, like our Base64, requires two bytes per letter, doubling the effective size of the binary portion to 81,218 bytes. Yikes! Entries in a Uint8Array, however, only require one byte each, dropping the effective size to that of the original file (with some padding for buffers) to 30,454 bytes.

We have a winner!

But wait, there's more!

Because this is a DOM script, there are implications within the HTML and rendering scopes to consider. To understand that side of things, let's look at how they might be used:

For data in Base64 format, developers can simply use the Data URI protocol and call it a day. In practice, that would look like:

const img = new Image();
img.src = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAA…';
document.body.appendChild(img);

Afterward, you'd have something like this in the HTML:

<!-- Base64. -->
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAoAAA…" />

That's just the straight data prefixed with a MIME type and ";base64,". Pretty easy. But note, we've just added about 40 _kilo_bytes of data to the DOM. That's in addition to the same data already existing in our script!

HTML does not understand Uint8Array data in its raw format. To inject an image into the DOM, it must first be converted to a Blob, which can then be fed through the URL API to generate a URI:

// First the Blob:
const blob = new Blob(
    [new Uint8Array([137,80,78,71,13,10,26,10,0,0,0,13,73,…])],
    { type: 'image/png' }
);

// Now the image.
const img = new Image();
img.src = URL.createObjectURL(blob);
document.body.appendChild(img);

The first takeaway is that in addition to being physically bigger, Uint8Arrays require more Javascript code to handle, making the disk size of the script that much bigger still. But that's storage. Stop thinking about that. We've moved on!

The HTML that comes from this looks as follows:

<!-- Blob. -->
<img src="blob:null/31c9feff-bb91-4e87-a324-a9ae393f7ca3" />

The above source has not been truncated for legibility; it is what it is, roughly 40 bytes, or a string 100x shorter than the Data URI was. That's… a lot smaller.

With just a single image on the page, the difference is magnitudes. And if there were more than one image? Each time we'd add another 40KB for Data or another 40B for Blobs. (We're oversimplifying again, but you get the idea.)

But DOM waste is only part of the runtime story. We can't forget about the painting!

For the browser to make any sense of what we've given it, it must decode the image back into its original binary PNG format, and convert that into an uncompressed RGB(A) Bitmap for display. For our Data URIs (Base64), this conversion occurs once per image element. Our Blobs (Uint8Array), on the other hand, all link to the same, shared internal reference, data that was already decoded as a PNG image when we first ran URL.createObjectURL(). No further work is necessary for the browser. (Again, lots of oversimplifications here, but the broad strokes are Good and True.)

Having and Eating Cake

So it would appear that in terms of storage, Base64 is the clear winner, but in terms of runtime resource usage and general performance, Uint8Arrays are better. If a project has a single limiting factor to consider, you'd just want to pick whichever peg best fits the hole in your heart.

With JS Mate Poe, though, both the file size and the runtime requirements are limiting factors. So what to do?

Easy: do it all!

Having your cake and eating it to is actually something computer systems do as a matter of course. The HTML behind this blog post, for example, was compressed server-side before being sent to your browser. Your browser had to take a moment to decompress its contents after receipt, but because your CPU is so much faster than your Internet connection, it ended up a net win.

The binary data embedded in our script can be thought of in much the same way. We can store it in the smallest, most compressed format possible (Base64), then convert it to something better (Uint8Array) for actual use.

The first thing we'll need is a function to convert a Base64 string into a Uint8Array as that is not something Javascript can handle natively:

/**
 * Base64 to Blob
 *
 * @param {string} data Data.
 * @param {string} type Content type.
 * @return {!Blob} Blob.
 */
const base64toBlob = function(data, type) {
	const bytes = atob(data);
	let length = bytes.length;
	let out = new Uint8Array(length);

	// Loop and convert.
	while (length--) {
		out[length] = bytes.charCodeAt(length);
	}

	return new Blob([out], { type: type });
};

With that in place, the image-spawning code in our first two examples could look like:

const img = new Image();
img.src = URL.createObjectURL(
    base64toBlob(
        'iVBORw0KGgoAAAANSUhEUgAAAoAAA…',
        'image/png'
    )
);
document.body.appendChild(img);

The extra Javascript code adds about 400 bytes to the script's total size, but that's a far cry from the roughly 68,000 bytes that would be added were we to just store the image in native Uint8Array data.

There are also going to be some penalties during the execution of this code because we're running more operations than we would if the data were one way or the other. But in this particular case, that's okay, because this only runs once during initialization. All of the images that are spawned subsequently benefit from this pre-computed answer and so run more efficiently.

At the end of the day, it's what's best for Poe.

Josh Stoik
17 October 2019
Previous Replacing WPA Supplicant with iwd in Ubuntu Eoan
Next Randomizing Weighted Choices in Javascript