Archive for 2009

One of the chores that I do for the Apache HTTP server project, every three months or so, is to slog through the IANA media type registry to see what new media types have been registered and add them to the mime.types configuration file. This is one of the few things I do that is almost all pain for little or no gain. It takes hours to do it right because IANA has gone out of their way to make the registry impossible to process automatically via simple scripts. I don’t even get the pleasure of “changing the world” in some meaningful way, since Apache doesn’t update mime.types automatically when installed to an existing configuration.

BTW, if you are responsible for an existing Apache installation, please copy the current mime.types configuration file and install it manually — your users will thank you later not gripe as much about unsupported media types.

IANA is a quaint off-shoot of the Internet Engineering Taskforce that, much like the IETF, is still stuck in the 1980s. One would think that, given a task like “maintain a registry of all media types” so that Internet software can communicate, would lead to something that is comprehensible by software. Instead, what IANA has provided is a collection of FTP directories containing a subset of private registry templates, each in the original (random) submitted format, and nine separate inconsistently-formated index.html files that actually contain the registered types.

The first thought that any Web developer has when they look at the registry is that it should be laid out as a resource space by type. That is, each directory under “media-types” would be a major type (e.g., application, text, etc.) and then each file within those directories would correspond to exactly one subtype (e.g., html, plain, csv, etc.). Such a design would be easy to process automatically and fits with the organization’s desire to serve everything via both FTP and HTTP. Sadly, that is not the case. Most of the private registrations have some sort of like-named file within the expected directory to contain its registration template, but the names do not always correspond exactly to the subtype and the contents are whatever random text was submitted (rather than some consistent format that could be extracted). What’s worse, however, is that the standardized types do not have any corresponding file; instead, the type’s entry in the index may have some sort of link to the RFC or external specification that defines that type.

grumble

The second thought of any Web developer would be “oh, I’ll just have to process the index files to extract the media type fields.” Good luck. The HTML is not well-formed (even by HTML standards). It uses arbitrarily-created tables to contain the actual registry information. There is no consistency across the files in terms of the number of table columns, nor any column headers to indicate what they mean. There is no mark-up to distinguish the registry cells from other whitespace-arranging layout cells. And the registered types are occasionally wrapped in inconsistently-targeted anchors for links to the aforementioned template files.

grumble GRUMBLE

Okay, so the really stubborn Web developers think that maybe a browser can grok this tag soup and generate the table in some reasonably consistent fashion, which can then be screen-scraped to get the relevant information. Nope. It doesn’t even render the same on different browsers. In any case, the index files don’t contain the relevant information: the most important information (aside from the type name) is the unique filename extension(s) that are supposed to be used for files of that type. For that information, we have to follow the link to the registry template file, or RFC containing one or more template files, and look for the optional form field for indicating extensions. Most of the time, the field is empty or just plain wrong (i.e., almost all XML-based formats suggest that the filename extension is .xml, in spite of the fact that the only reason to supply an extension is so that all files of that extension can be mapped to that specific type).

sigh

And, perhaps the most annoying thing of all: the index files are obviously being generated from some other data source that is not part of the public registry.

Normally, what I am left with is a semi-manual procedure. I keep a mirror of the registry files on my laptop and, each time I need to do an update, I pull down a new mirror and run a diff between the old and new index files. I then manually look-up the registry template for file extensions or, if that fails, do a web search for what the deployed software already does. I then do a larger Web search for documentation that various companies have published about their unregistered file types, since I’ve given up on the idea that companies like Adobe, Microsoft, and Sun will ever register their own types before deploying some half-baked experimental names that we are stuck with forever due to backwards-compatibility concerns.

Unfortunately, yesterday I messed up that normal procedure. I forgot that I had started to do the update a month ago by pulling down a new mirror, but hadn’t made the changes yet. So I blew away my last-update-point before doing the diff.

groan

After reliving all of the above steps, I ended up with a new semi-manual procedure:

wget -m ftp://ftp.iana.org/assignments/media-types/
cd ftp.iana.org/assignments/media-types
foreach typ (`find * -type d -print`)
   links -dump $typ/index.html | \
      perl -p -e "s|^\s+|$typ/|;" >> mtypes.txt
end
# manually edit mtypes.txt to remove the garbage lines
foreach typ (`cut -d ' ' mtypes.txt`)
   grep -q -i -F "$typ" mime.types || echo $typ
end

That gave me a list of new registered types that were not already present in mime.types. I still had to go through the list manually, add each type to its location within mime.types, and search for its corresponding file extension within the registry templates. As usual, most of the types either had no file extension (typical for types that are only expected to be used within message envelopes) or non-unique extensions that can’t be added to the configuration file because they would override some other (more common) type.

Please, IANA folks, fix your registries so that they can be read by automated processes. Do not tell me that I have to write an RFC to specify how you store the registry files. The existing mess was not determined by an RFC, so you are free to fix it without a new RFC. If you have software generating the current registry, then I will be more than happy to fix it for you if you provide me with the source code. At the very least, include a text/csv export of whatever database you are using to construct the awful index files within the current registry.

Why am I bothering with all this? Because media types are the only means we have for an HTTP sender to express the intent for processing a given message payload. While some people have claimed that recipients should sniff the data format for type information, the fact is that all data formats correspond to multiple media types. Sniffing a media type is therefore inherently impossible: at best, it can indicate when a data format does not match the indicated media type; at worst, it breaks correct configurations and creates security holes. In any case, sniffing cannot determine the sender’s intent.

The intent can only be expressed by sending the right Content-Type for a given resource. The resource owner needs to configure their resource correctly. Even though Apache provides at least five different ways to set the media type, most authors still rely on the installed file extension mappings for representations that are not dynamically-generated. Hence, most will rely on whatever mime.types file has been installed by their webmaster, even if it hasn’t been updated in ten years.

How old is your mime.types file?

Tim Bray’s article on RESTful Casuistry revisits an odd meme in the REST debates that I’ve been meaning to discredit for a while.

Some people think that REST suggests not to use POST for updates.  Search my dissertation and you won’t find any mention of CRUD or POST. The only mention of PUT is in regard to HTTP’s lack of write-back caching.  The main reason for my lack of specificity is because the methods defined by HTTP are part of the Web’s architecture definition, not the REST architectural style. Specific method definitions (aside from the retrieval:resource duality of GET) simply don’t matter to the REST architectural style, so it is difficult to have a style discussion about them. The only thing REST requires of methods is that they be uniformly defined for all resources (i.e., so that intermediaries don’t have to know the resource type in order to understand the meaning of the request). As long as the method is being used according to its own definition, REST doesn’t have much to say about it.

For example, it isn’t RESTful to use GET to perform unsafe operations because that would violate the definition of the GET method in HTTP, which would in turn mislead intermediaries and spiders.  It isn’t RESTful to use POST for information retrieval when that information corresponds to a potential resource, because that usage prevents safe reusability and the network-effect of having a URI. But why shouldn’t you use POST to perform an update? Hypertext can tell the client which method to use when the action being taken is unsafe. PUT is necessary when there is no hypertext telling the client what to do, but lacking hypertext isn’t particularly RESTful.

POST only becomes an issue when it is used in a situation for which some other method is ideally suited: e.g., retrieval of information that should be a representation of some resource (GET), complete replacement of a representation (PUT), or any of the other standardized methods that tell intermediaries something more valuable than “this may change something.” The other methods are more valuable to intermediaries because they say something about how failures can be automatically handled and how intermediate caches can optimize their behavior. POST does not have those characteristics, but that doesn’t mean we can live without it. POST serves many useful purposes in HTTP, including the general purpose of “this action isn’t worth standardizing.”

I think the anti-POST meme got started because of all the arguments against tunneling other protocols via HTTP’s POST (e.g., SOAP, RSS, IPP, etc.). Somewhere along the line people started equating the REST arguments of “don’t violate HTTP’s method definitions” and “always use GET for retrieval because that forces the resource to have a URI” with the paper tiger of “POST is bad.”

Please, let’s move on. We don’t need to use PUT for every state change in HTTP. REST has never said that we should.

What matters is that every important resource have a URI, therein allowing representations of that resource to be obtained using GET.  If the deployment state is an important resource, then I would expect it to have states for undeployed, deployment requested, deployed, and undeployment requested. The advantage of those states is that other clients looking at the resource at the same time would be properly informed, which is just good design for UI feedback. However, I doubt that Tim’s application would consider that an important resource on its own, since the deployment state in isolation (separate from the thing being deployed) is not a very interesting or reusable resource.

Personally, I would just use POST for that button. The API can compensate for the use of POST by responding with the statement that the client should refresh its representation of the larger resource state. In other words, I would return a 303 response that redirected back to the VM status, so that the client would know that the state has changed.