Climbing MIME Improbable

A MIME type is a simple two-part identifier used to categorize (virtually all) file and content formats. It is particularly useful on the internet, where often times networks and browsers have to make educated guesses about what a stream of content is before the substance of that stream has been fully downloaded. Your desktop operating system, too, can use this information to determine the best program to open a given file with, even if that file has been incorrectly named.

It's standard, universal, and perfect.

Haha. Just kidding.

(un)Standard

The official MIME type registry is maintained by the Internet Assigned Numbers Authority (IANA). Its archive is vast and the web site's spartan design has a reassuring bureaucracy-first charm to it. But it is incomplete.

Genesis 1:28

As with many problems of the modern era, the origin of IANA's struggles can be traced to the Bible, specifically Genesis 1:28. "Then God blessed Microsoft and said, 'Be fruitful and multiply. Fill the earth with your crap.'" And so they did, and others did, and everyone does still. Technology evolves at a feverish pace. Some ideas barely get off the ground, while others quickly rise to dominate the market. Any attempt to catch 'em all will be at least somewhat fruitless.

Excitement

But sometimes The Next Big Thing can't come soon enough. When a new technology shows particular promise, application vendors can't afford to wait for it to actually exist. No, they have to immediately begin working on integrating support for the format within their applications. Their users, and by extension content providers, then must hurriedly begin using the format. All this despite the fact that it may be in its infancy, with a long road and many changes yet to come before it reaches maturity.

Meanwhile, this format needs a MIME type so the applications can handle it. Not official, because it doesn't officially exist yet, but something logical. How about application/x-font-woff or application/font-woff or application/woff or, you know what, there are so many fonts now, how about something standard for all of them, like font/woff?

And so across space and time, a meme war rages. Ultimately, if the format reaches maturity, it will have an official IANA type, and then a new era of peace can be ushered in.

History

Only unlike people, technology never truly dies. Once one version of one program has said, "You Are This," there will be subsets of the population using that software decades beyond its best-by date. And so our friend the woff is left with an identity crisis it can never truly shake. As it walks down the street, people will greet it by by name, but more often than not it won't be the one on its driver's license.

Which Leaves Us...

And so, we have a standard which isn't standard, or complete, and even when there is a standard standard and it is complete, many platforms of different ages or levels of competency will still disagree.

In short, we have what programmers call a Grade-A Clusterfsck.

This is particularly bad for web developers who have to deal with file and content formats on a daily basis, where getting it wrong can crash an application or open up a Pandora's Box of security vulnerabilities. But even with the stakes so high, no adequate tools existed to address the problem.

Ignoring the chorus of "Life sucks and then you die," I decided to place a foot at the base of MIME Improbable.

Make It Work

Someone (not Tim Gunn) once said I was like the investigative journalist version of a programmer. The more I thought about it, the more apt that label seemed. The MIME problem isn't impossible. It is just tedious, confusing, and full of nuance, twists, turns, and intrigue. I'm old enough to have been imbued with an attention span! I rather like crawling in the muddy streams of bits and bytes.

Call me Deep Float.

Research and Data

And so I began.

Step one was to figure out exactly what the hell MIMEs are, how they work, and where they come from. We investigative journalists call this "background". You know, all of the words in all the paragraphs preceding this one.

The next step was to come up with a plan. Oddly, the problem of history is almost always completely ignored by all applications and MIME databases. Everywhere I looked, there was just a One Format, One Type declaration. That kind of approach is fine if you're working in a bubble, but as soon as an outside source is consulted, the system breaks down like the Tower of Babel.

Last month, the ubiquitous blogging platform WordPress discovered this for itself when a new line of code inserted as a security measure in a dot-dot release ended up severely breaking the file upload functionality for its millions of users. (The issue is still unresolved 4+ months later and is beginning to look like a WONT-FIX situation. WP users are encouraged to install the plugin Lord of the Files to fix that and many other upload-related security issues.)

So I decided that one:one was no fun, and instead set about building an array. Since almost no single data source can be bothered to do the same, I clearly needed to combine multiple sources, preferably ones which disagree with each other.

As a first pass, I settled on IANA's master list, because, duh, it's the real one, and added to that the datasets used by the two most popular server software platforms, Apache and Nginx. This gave me about 1000 MIME types right out of the gate, a good start, and about 10x what WordPress, for example, can internally decode.

Unfortunately the first thousand entries didn't do much to address the history problem. All three sources are kept fairly up-to-date.

The next breakthrough was the addition of the list compiled by Freedesktop.org, which many Linux systems use internally to make sense of mystery files. That list contains information about many alien file types unknown to the others, and also includes information about aliases and parent classifications (which, oddly, some platforms will return instead of the actual specific type for the specific thing in question).

This brought the total list to around 1400 definitions. I thought I was done, patted my ego on its back, congratulated it on an impossible job well done in a single afternoon.

Then I remembered Apple, that reclusive and spiteful Unix hermit. Apple, it seems, doesn't bother registering its file types with bodies like IANA, and since nobody actually uses iWork, none of those files end up online and so escape the notice of server software. "Pages? We don't need no stinking Pages."

So more digging.

I ended up adding a fifth dataset compiled by an offshoot of Apache called Tika. While largely drawn from the Freedesktop.org data I was already using, it is nicely expanded to include many wacky Mac-y media types.

And that, dear reader, brings us to the present, 1800 entries and counting.

The Scoop

But that isn't quite the end of the story. If you happened to click any of the above data links, you'll have noticed that they're all wildly different. Some are at least meant to be parsed, but IANA's list, for example, isn't a list at all, but a collection of hundreds of web pages.

Thankfully, I'm a fully fledged RegEx wizard (Grepindor, Class of Y2K) and so was able to cast some esoteric parsing spells to rein in the databeasts, grind them up and mold them into a more user-friendly sausage.

With the data digestible, all that remained was the buns, so to speak, a few simple functions to compliment the database, tying it all together into one tidy, palatable framework.

The end result of these labors is an open-source PHP library called blob-mimes. Any developer can now incorporate that library into their project and begin to make sense of the files and content streams floating across their code.

The future is bright and hopeful. Our children will someday be able to sit together and share an audio/ogg in harmonious agreement.

	Josh Stoik 8 March 2017
Previous	Optimizing SVGs For Web Use
Next	When a Stranger Calls: Sanitizing SVGs