Wed 13 May 2009
One of the chores that I do for the Apache HTTP server project, every three months or so, is to slog through the IANA media type registry to see what new media types have been registered and add them to the mime.types configuration file. This is one of the few things I do that is almost all pain for little or no gain. It takes hours to do it right because IANA has gone out of their way to make the registry impossible to process automatically via simple scripts. I don’t even get the pleasure of “changing the world” in some meaningful way, since Apache doesn’t update mime.types automatically when installed to an existing configuration.
BTW, if you are responsible for an existing Apache installation, please copy the current mime.types configuration file and install it manually — your users will
thank you later not gripe as much about unsupported media types.
IANA is a quaint off-shoot of the Internet Engineering Taskforce that, much like the IETF, is still stuck in the 1980s. One would think that, given a task like “maintain a registry of all media types” so that Internet software can communicate, would lead to something that is comprehensible by software. Instead, what IANA has provided is a collection of FTP directories containing a subset of private registry templates, each in the original (random) submitted format, and nine separate inconsistently-formated index.html files that actually contain the registered types.
The first thought that any Web developer has when they look at the registry is that it should be laid out as a resource space by type. That is, each directory under “media-types” would be a major type (e.g., application, text, etc.) and then each file within those directories would correspond to exactly one subtype (e.g., html, plain, csv, etc.). Such a design would be easy to process automatically and fits with the organization’s desire to serve everything via both FTP and HTTP. Sadly, that is not the case. Most of the private registrations have some sort of like-named file within the expected directory to contain its registration template, but the names do not always correspond exactly to the subtype and the contents are whatever random text was submitted (rather than some consistent format that could be extracted). What’s worse, however, is that the standardized types do not have any corresponding file; instead, the type’s entry in the index may have some sort of link to the RFC or external specification that defines that type.
The second thought of any Web developer would be “oh, I’ll just have to process the index files to extract the media type fields.” Good luck. The HTML is not well-formed (even by HTML standards). It uses arbitrarily-created tables to contain the actual registry information. There is no consistency across the files in terms of the number of table columns, nor any column headers to indicate what they mean. There is no mark-up to distinguish the registry cells from other whitespace-arranging layout cells. And the registered types are occasionally wrapped in inconsistently-targeted anchors for links to the aforementioned template files.
Okay, so the really stubborn Web developers think that maybe a browser can grok this tag soup and generate the table in some reasonably consistent fashion, which can then be screen-scraped to get the relevant information. Nope. It doesn’t even render the same on different browsers. In any case, the index files don’t contain the relevant information: the most important information (aside from the type name) is the unique filename extension(s) that are supposed to be used for files of that type. For that information, we have to follow the link to the registry template file, or RFC containing one or more template files, and look for the optional form field for indicating extensions. Most of the time, the field is empty or just plain wrong (i.e., almost all XML-based formats suggest that the filename extension is .xml, in spite of the fact that the only reason to supply an extension is so that all files of that extension can be mapped to that specific type).
And, perhaps the most annoying thing of all: the index files are obviously being generated from some other data source that is not part of the public registry.
Normally, what I am left with is a semi-manual procedure. I keep a mirror of the registry files on my laptop and, each time I need to do an update, I pull down a new mirror and run a diff between the old and new index files. I then manually look-up the registry template for file extensions or, if that fails, do a web search for what the deployed software already does. I then do a larger Web search for documentation that various companies have published about their unregistered file types, since I’ve given up on the idea that companies like Adobe, Microsoft, and Sun will ever register their own types before deploying some half-baked experimental names that we are stuck with forever due to backwards-compatibility concerns.
Unfortunately, yesterday I messed up that normal procedure. I forgot that I had started to do the update a month ago by pulling down a new mirror, but hadn’t made the changes yet. So I blew away my last-update-point before doing the diff.
After reliving all of the above steps, I ended up with a new semi-manual procedure:
wget -m ftp://ftp.iana.org/assignments/media-types/ cd ftp.iana.org/assignments/media-types foreach typ (`find * -type d -print`) links -dump $typ/index.html | \ perl -p -e "s|^\s+|$typ/|;" >> mtypes.txt end # manually edit mtypes.txt to remove the garbage lines foreach typ (`cut -d ' ' mtypes.txt`) grep -q -i -F "$typ" mime.types || echo $typ end
That gave me a list of new registered types that were not already present in mime.types. I still had to go through the list manually, add each type to its location within mime.types, and search for its corresponding file extension within the registry templates. As usual, most of the types either had no file extension (typical for types that are only expected to be used within message envelopes) or non-unique extensions that can’t be added to the configuration file because they would override some other (more common) type.
Please, IANA folks, fix your registries so that they can be read by automated processes. Do not tell me that I have to write an RFC to specify how you store the registry files. The existing mess was not determined by an RFC, so you are free to fix it without a new RFC. If you have software generating the current registry, then I will be more than happy to fix it for you if you provide me with the source code. At the very least, include a text/csv export of whatever database you are using to construct the awful index files within the current registry.
Why am I bothering with all this? Because media types are the only means we have for an HTTP sender to express the intent for processing a given message payload. While some people have claimed that recipients should sniff the data format for type information, the fact is that all data formats correspond to multiple media types. Sniffing a media type is therefore inherently impossible: at best, it can indicate when a data format does not match the indicated media type; at worst, it breaks correct configurations and creates security holes. In any case, sniffing cannot determine the sender’s intent.
The intent can only be expressed by sending the right Content-Type for a given resource. The resource owner needs to configure their resource correctly. Even though Apache provides at least five different ways to set the media type, most authors still rely on the installed file extension mappings for representations that are not dynamically-generated. Hence, most will rely on whatever mime.types file has been installed by their webmaster, even if it hasn’t been updated in ten years.
How old is your mime.types file?