Wed 13 May 2009
Wrangling mime.types
Posted by Roy T. Fielding under open source, standards, web architecture
[13] Comments
One of the chores that I do for the Apache HTTP server project, every three months or so, is to slog through the IANA media type registry to see what new media types have been registered and add them to the mime.types configuration file. This is one of the few things I do that is almost all pain for little or no gain. It takes hours to do it right because IANA has gone out of their way to make the registry impossible to process automatically via simple scripts. I don’t even get the pleasure of “changing the world” in some meaningful way, since Apache doesn’t update mime.types automatically when installed to an existing configuration.
BTW, if you are responsible for an existing Apache installation, please copy the current mime.types configuration file and install it manually — your users will thank you later not gripe as much about unsupported media types.
IANA is a quaint off-shoot of the Internet Engineering Taskforce that, much like the IETF, is still stuck in the 1980s. One would think that, given a task like “maintain a registry of all media types” so that Internet software can communicate, would lead to something that is comprehensible by software. Instead, what IANA has provided is a collection of FTP directories containing a subset of private registry templates, each in the original (random) submitted format, and nine separate inconsistently-formated index.html files that actually contain the registered types.
The first thought that any Web developer has when they look at the registry is that it should be laid out as a resource space by type. That is, each directory under “media-types” would be a major type (e.g., application, text, etc.) and then each file within those directories would correspond to exactly one subtype (e.g., html, plain, csv, etc.). Such a design would be easy to process automatically and fits with the organization’s desire to serve everything via both FTP and HTTP. Sadly, that is not the case. Most of the private registrations have some sort of like-named file within the expected directory to contain its registration template, but the names do not always correspond exactly to the subtype and the contents are whatever random text was submitted (rather than some consistent format that could be extracted). What’s worse, however, is that the standardized types do not have any corresponding file; instead, the type’s entry in the index may have some sort of link to the RFC or external specification that defines that type.
grumble
The second thought of any Web developer would be “oh, I’ll just have to process the index files to extract the media type fields.” Good luck. The HTML is not well-formed (even by HTML standards). It uses arbitrarily-created tables to contain the actual registry information. There is no consistency across the files in terms of the number of table columns, nor any column headers to indicate what they mean. There is no mark-up to distinguish the registry cells from other whitespace-arranging layout cells. And the registered types are occasionally wrapped in inconsistently-targeted anchors for links to the aforementioned template files.
grumble GRUMBLE
Okay, so the really stubborn Web developers think that maybe a browser can grok this tag soup and generate the table in some reasonably consistent fashion, which can then be screen-scraped to get the relevant information. Nope. It doesn’t even render the same on different browsers. In any case, the index files don’t contain the relevant information: the most important information (aside from the type name) is the unique filename extension(s) that are supposed to be used for files of that type. For that information, we have to follow the link to the registry template file, or RFC containing one or more template files, and look for the optional form field for indicating extensions. Most of the time, the field is empty or just plain wrong (i.e., almost all XML-based formats suggest that the filename extension is .xml, in spite of the fact that the only reason to supply an extension is so that all files of that extension can be mapped to that specific type).
sigh
And, perhaps the most annoying thing of all: the index files are obviously being generated from some other data source that is not part of the public registry.
Normally, what I am left with is a semi-manual procedure. I keep a mirror of the registry files on my laptop and, each time I need to do an update, I pull down a new mirror and run a diff between the old and new index files. I then manually look-up the registry template for file extensions or, if that fails, do a web search for what the deployed software already does. I then do a larger Web search for documentation that various companies have published about their unregistered file types, since I’ve given up on the idea that companies like Adobe, Microsoft, and Sun will ever register their own types before deploying some half-baked experimental names that we are stuck with forever due to backwards-compatibility concerns.
Unfortunately, yesterday I messed up that normal procedure. I forgot that I had started to do the update a month ago by pulling down a new mirror, but hadn’t made the changes yet. So I blew away my last-update-point before doing the diff.
groan
After reliving all of the above steps, I ended up with a new semi-manual procedure:
wget -m ftp://ftp.iana.org/assignments/media-types/ cd ftp.iana.org/assignments/media-types foreach typ (`find * -type d -print`) links -dump $typ/index.html | \ perl -p -e "s|^\s+|$typ/|;" >> mtypes.txt end # manually edit mtypes.txt to remove the garbage lines foreach typ (`cut -d ' ' mtypes.txt`) grep -q -i -F "$typ" mime.types || echo $typ end
That gave me a list of new registered types that were not already present in mime.types. I still had to go through the list manually, add each type to its location within mime.types, and search for its corresponding file extension within the registry templates. As usual, most of the types either had no file extension (typical for types that are only expected to be used within message envelopes) or non-unique extensions that can’t be added to the configuration file because they would override some other (more common) type.
Please, IANA folks, fix your registries so that they can be read by automated processes. Do not tell me that I have to write an RFC to specify how you store the registry files. The existing mess was not determined by an RFC, so you are free to fix it without a new RFC. If you have software generating the current registry, then I will be more than happy to fix it for you if you provide me with the source code. At the very least, include a text/csv export of whatever database you are using to construct the awful index files within the current registry.
Why am I bothering with all this? Because media types are the only means we have for an HTTP sender to express the intent for processing a given message payload. While some people have claimed that recipients should sniff the data format for type information, the fact is that all data formats correspond to multiple media types. Sniffing a media type is therefore inherently impossible: at best, it can indicate when a data format does not match the indicated media type; at worst, it breaks correct configurations and creates security holes. In any case, sniffing cannot determine the sender’s intent.
The intent can only be expressed by sending the right Content-Type for a given resource. The resource owner needs to configure their resource correctly. Even though Apache provides at least five different ways to set the media type, most authors still rely on the installed file extension mappings for representations that are not dynamically-generated. Hence, most will rely on whatever mime.types file has been installed by their webmaster, even if it hasn’t been updated in ten years.
How old is your mime.types file?
and IANA, while you’re at it cleaning up the MIME type registry and turning everything into a data source that people can actually use: please create a feed of new MIME types, so that i can subscribe and whenever a new type is registered, i will simply see it showing up in my feed reader. producing such a feed would be trivial if some structured processing is used, and i would even volunteer do to it. i think such a feed would be very useful for anybody having to keep track what’s going on on the Internet in terms or registered MIME types. thanks, IANA!
btw, the IETF has such a feed for recent RFCs at http://xml.resource.org/public/rfc/bibxml/index.rdf (ironically not using Atom but RSS), and it’s really useful.
“btw, the IETF has such a feed for recent RFCs at http://xml.resource.org/public/rfc/bibxml/index.rdf (ironically not using Atom but RSS), and it’s really useful.”
…but that’s not “the” IETF, but xml.resource.org. Not that I wouldn’t be surprised if the IETF used RSS for a feed anyway :-)
Good post.
Yeah its really embarrasing when such standard organizations can service better services.
And you start to question if they are from the former stone-age with a website that looks like the early days of the internet age. Even their copyright on the media type page is is not updated:
(c) 1999-2001 The Internet Corporation for Assigned Names and Numbers All rights reserved.
Note that IANA is currently converting registries into XML, see http://www.iana.org/reports/2008/xml-registry-launch.html.
That the media types registry isn’t converted yet probably is partly caused by the state it’s in :-)
I went looking how my distribution handles this (since it apparently doesn’t use the mime.types provided by Apache, but has its own package), and found the following email message:
On Fri Aug 31 10:51:55 2007, someone wrote:
> I have noticed that there doesn’t seem to be any way to extract all
> the current mimetypes and their extensions from the iana website.
> This is particularly difficult for Linux distros in particular when
> they want to update their mime.types file because they have to
> manually sift through the iana website to see what has changed. If
> you provided a file that dynamically displayed all mime types and
> their file extensions from your database, it would smooth over the
> update process for many people. If you currently already have such a
> file, could you send me the link to it?
I’m afraid we don’t provide such a file. We are, however, currently
working on making the IANA-maintained registries available in XML
format.
Best regards,
Amanda Baber
IANA
Great example how RDF would be very helpful and that it’s doomed.
This is a timely post. I just finished an HTML scraper for the IANA MIME indexes that generates an XML Schema file with a single simpleType that can be used in other XML Schema if you need an enumerated list of MIME types: http://gita.grainger.uiuc.edu/imt/
It would definitely be a lot easier if the IANA provided an XML-based registry of some sort.
Thanks, Roy!
I couldn’t have said it better. I recently ran into comparable troubles when executing an action [1] for the W3C Media Fragments group (though not as elegant solution as you have; I must admit I didn’t even know they have an FTP version of it).
An absolute +1 for dret’s feed proposal (Atom, yes please;)
Once this is in place I volunteer to take care of an RDF/linked data version of it ;)
I’d even go a step further, contemplating on bugging the TAG with it. If enough people and institutions shout out, it might (maybe) motivate IANA to react.
Any other proposals to ‘motivate’ them?
Cheers,
Michael
[1] https://www.w3.org/2008/WebVideo/Fragments/wiki/MediaTypeReview
Wouldn’t it be easier to “just” publish it somewhere else in a concerted community effort, making it publicly available and handing it over to IANA as soon as they see that it would actually make their life easier?
I’m surprised about the .xml extensions. That’s probably the most common problem with XML registrations that I deal with as a reviewer on ietf-types, but I think I’ve done a pretty good job (despite having to fight another reviewer on it!) keeping the xml extension out of registrations for at least the past 3/4 years. If you remember any, or spot any more that are using it, please let me know. Thanks.
Hi Mark,
I think the most recent .xml ones were inside RFCs, so I suspect you didn’t get a chance to review them. Of course, the problem isn’t just .xml — the same occurs with .jar (OSGI bundles) and a few other generic extensions.
We do receive some registrations embedded within would-be RFCs. I hope the IESG isn’t forgetting to verify that all new registrations have been reviewed. I’ll look into it.
In addition to a big +1 on the grumbling, I want to thank you for all your efforts in maintaining this key file.
I have regularly been pleasantly surprised that the new-ish types I have had occasion to care about have always been present as part of updated Apache distributions.