open source


On the same day that Liam was born, I received news that one of my two papers published at the ICSE 2000 conference has been given the International Conference on Software Engineering’s Most Influential Paper Award for its impact on software engineering research over the past decade. The paper, A case study of open source software development: the Apache server, is co-authored by Audris Mockus, myself, and James Herbsleb. The MIP is an important award within the academic world; my thanks to the award committee and congrats to Audris and Jim. I wish I could have been there in South Africa for the presentation. This year’s award is shared with a paper by Corbett et al. on Bandera.

Interestingly, my other paper in ICSE 2000 was the first conference paper about REST, co-authored with my adviser, Dick Taylor. That must have caused some debate within the awards committee. As I understand it, the MIP award is based on academic citations of the original paper and any follow-up publication in a journal. Since I encouraged people to read and cite my dissertation directly, rather than the ICSE paper’s summary or its corresponding journal version, I am not surprised that the REST paper is considered less influential. However, it does make we wonder what would have happened if I had never published my dissertation on the Web. Would that paper have been cited more, or would nobody know about REST? shrug. I like the way it turned out.

The next two International Conferences on Software Engineering will be held in Hawaii (ICSE 2011), with Dick as the general chair, and Zürich (ICSE 2012). That is some fine scheduling on the part of the conference organizers! Fortunately, I have a pretty good excuse to attend both.

One of the chores that I do for the Apache HTTP server project, every three months or so, is to slog through the IANA media type registry to see what new media types have been registered and add them to the mime.types configuration file. This is one of the few things I do that is almost all pain for little or no gain. It takes hours to do it right because IANA has gone out of their way to make the registry impossible to process automatically via simple scripts. I don’t even get the pleasure of “changing the world” in some meaningful way, since Apache doesn’t update mime.types automatically when installed to an existing configuration.

BTW, if you are responsible for an existing Apache installation, please copy the current mime.types configuration file and install it manually — your users will thank you later not gripe as much about unsupported media types.

IANA is a quaint off-shoot of the Internet Engineering Taskforce that, much like the IETF, is still stuck in the 1980s. One would think that, given a task like “maintain a registry of all media types” so that Internet software can communicate, would lead to something that is comprehensible by software. Instead, what IANA has provided is a collection of FTP directories containing a subset of private registry templates, each in the original (random) submitted format, and nine separate inconsistently-formated index.html files that actually contain the registered types.

The first thought that any Web developer has when they look at the registry is that it should be laid out as a resource space by type. That is, each directory under “media-types” would be a major type (e.g., application, text, etc.) and then each file within those directories would correspond to exactly one subtype (e.g., html, plain, csv, etc.). Such a design would be easy to process automatically and fits with the organization’s desire to serve everything via both FTP and HTTP. Sadly, that is not the case. Most of the private registrations have some sort of like-named file within the expected directory to contain its registration template, but the names do not always correspond exactly to the subtype and the contents are whatever random text was submitted (rather than some consistent format that could be extracted). What’s worse, however, is that the standardized types do not have any corresponding file; instead, the type’s entry in the index may have some sort of link to the RFC or external specification that defines that type.

grumble

The second thought of any Web developer would be “oh, I’ll just have to process the index files to extract the media type fields.” Good luck. The HTML is not well-formed (even by HTML standards). It uses arbitrarily-created tables to contain the actual registry information. There is no consistency across the files in terms of the number of table columns, nor any column headers to indicate what they mean. There is no mark-up to distinguish the registry cells from other whitespace-arranging layout cells. And the registered types are occasionally wrapped in inconsistently-targeted anchors for links to the aforementioned template files.

grumble GRUMBLE

Okay, so the really stubborn Web developers think that maybe a browser can grok this tag soup and generate the table in some reasonably consistent fashion, which can then be screen-scraped to get the relevant information. Nope. It doesn’t even render the same on different browsers. In any case, the index files don’t contain the relevant information: the most important information (aside from the type name) is the unique filename extension(s) that are supposed to be used for files of that type. For that information, we have to follow the link to the registry template file, or RFC containing one or more template files, and look for the optional form field for indicating extensions. Most of the time, the field is empty or just plain wrong (i.e., almost all XML-based formats suggest that the filename extension is .xml, in spite of the fact that the only reason to supply an extension is so that all files of that extension can be mapped to that specific type).

sigh

And, perhaps the most annoying thing of all: the index files are obviously being generated from some other data source that is not part of the public registry.

Normally, what I am left with is a semi-manual procedure. I keep a mirror of the registry files on my laptop and, each time I need to do an update, I pull down a new mirror and run a diff between the old and new index files. I then manually look-up the registry template for file extensions or, if that fails, do a web search for what the deployed software already does. I then do a larger Web search for documentation that various companies have published about their unregistered file types, since I’ve given up on the idea that companies like Adobe, Microsoft, and Sun will ever register their own types before deploying some half-baked experimental names that we are stuck with forever due to backwards-compatibility concerns.

Unfortunately, yesterday I messed up that normal procedure. I forgot that I had started to do the update a month ago by pulling down a new mirror, but hadn’t made the changes yet. So I blew away my last-update-point before doing the diff.

groan

After reliving all of the above steps, I ended up with a new semi-manual procedure:

wget -m ftp://ftp.iana.org/assignments/media-types/
cd ftp.iana.org/assignments/media-types
foreach typ (`find * -type d -print`)
   links -dump $typ/index.html | \
      perl -p -e "s|^\s+|$typ/|;" >> mtypes.txt
end
# manually edit mtypes.txt to remove the garbage lines
foreach typ (`cut -d ' ' mtypes.txt`)
   grep -q -i -F "$typ" mime.types || echo $typ
end

That gave me a list of new registered types that were not already present in mime.types. I still had to go through the list manually, add each type to its location within mime.types, and search for its corresponding file extension within the registry templates. As usual, most of the types either had no file extension (typical for types that are only expected to be used within message envelopes) or non-unique extensions that can’t be added to the configuration file because they would override some other (more common) type.

Please, IANA folks, fix your registries so that they can be read by automated processes. Do not tell me that I have to write an RFC to specify how you store the registry files. The existing mess was not determined by an RFC, so you are free to fix it without a new RFC. If you have software generating the current registry, then I will be more than happy to fix it for you if you provide me with the source code. At the very least, include a text/csv export of whatever database you are using to construct the awful index files within the current registry.

Why am I bothering with all this? Because media types are the only means we have for an HTTP sender to express the intent for processing a given message payload. While some people have claimed that recipients should sniff the data format for type information, the fact is that all data formats correspond to multiple media types. Sniffing a media type is therefore inherently impossible: at best, it can indicate when a data format does not match the indicated media type; at worst, it breaks correct configurations and creates security holes. In any case, sniffing cannot determine the sender’s intent.

The intent can only be expressed by sending the right Content-Type for a given resource. The resource owner needs to configure their resource correctly. Even though Apache provides at least five different ways to set the media type, most authors still rely on the installed file extension mappings for representations that are not dynamically-generated. Hence, most will rely on whatever mime.types file has been installed by their webmaster, even if it hasn’t been updated in ten years.

How old is your mime.types file?

Last week, I resigned from membership in OpenSolaris shortly after midnight on February 14th (Valentine’s Day). I won’t attempt to explain all of the reasons here. What I find more interesting at the moment is the propagation delay in the news.

For traditional media, an event that happened last week would be old news by now. After my message, I received a bunch of warm regrets from folks I know in the community, a collection of thanks from people who were just glad someone did something, and no formal reaction from anyone. No news coverage, no public apologies, and not even the sense that anything would change. There were a few blog mentions from people outside the community (Emily Ratliff on the 14th, rippling to Michael Dolan on the 15th, which in turn rippled to Jim Grisanzio on the 17th), but nobody asked me for further comments. That was last week.

This week, the traditional side of non-traditional media reporting went back to work (on Tuesday, actually, due to the three-day weekend for President’s day that some companies observe). It started on the 20th with a 9am email message from IDG asking for comment on my resignation, which the reporter had discovered while “looking through some Sun blogs.” The problem with spin control is that sometimes you spin up a larger storm than the one being controlled. That’s six full days after the event (five days after traditional media would have ignored the story as old news).

Unfortunately, I was stuck in the lobby of an Acura dealership waiting for my car, reading the 500 or so email messages I downloaded just before 9am, and did not get back to him in time even if I had wanted to add something (I didn’t). Naturally, Sun’s very competent PR team are never caught without something to say, and Terri’s response was polite with only the tiniest amount of spin. [“consultant” … WTF? I get paid for consulting. The only thing Sun provided me is travel expenses for two face-to-face meetings and one keynote talk in Berlin. Day paid all of my other costs. There is a huge difference between being a member of the community (an advisor) and being a consultant, even if the original invitation to join the OpenSolaris CAB came from Sun. Sun certainly didn’t refer to me as a consultant when they bragged about that in the press. If Sun wants to call me a consultant, maybe Day should send a bill for my hourly rate.]

To close the door on that article, IDG turned to an “industry analyst” from Redmonk. Stephen O’Grady, who does happen to be one of Sun’s consultants (in a refreshingly open way), tossed a little water on the embers of my resignation. He later rippled on the larger conversation as well, in a fairly balanced piece. Entirely accurate points on their own, yet entirely missing my point. Sun is running into trouble because it has a problem with honesty and with ethical behavior within a community setting, and you can’t blame that on anyone else (especially not the critics). What is the point of creating the OpenSolaris Community governance if the community isn’t even allowed to decide what is called OpenSolaris? This isn’t an abstract discussion of trademarks. It is the fundamental basis for making technical decisions of any kind for the project.

Sun made a commitment to open development with the OpenSolaris Charter. Sun does have the legal right to take advantage of its own trademarks, which is precisely what they had been doing for the past three years, in the press, by taking advantage of the positive spin regarding open development. Sun is fully capable of changing that decision by amending or dissolving the charter, but instead has chosen to ignore the governance model while at the same time claiming the open development manta as their own. I cannot support that. Sun was using my name as “proof” that they were listening to the open development community, so I had to go.

The issue is really quite simple: Sun wants to have your cake and eat it too.

In contrast, the MySQL model is open source, not open development. I respect that because they are honest about it, not just because the result is published as open source. There is even a MySQL user community, providing input to the company without any illusion that they are helping to develop the main product.

I am not a free software zealot. For me, open source is a business decision, not a religion. In my opinion, an open development model results in better source code, but that’s just one of many aspects that can improve or reduce software product quality. For example, Day Software developers participate in open development at Apache for almost all of our infrastructure software, which we then use as components within our not-entirely-open-source content management products. We learn from that open development experience, every day, and it influences all of the products that we develop. Each of our developers are better developers because they participate in open development, and that in turn has encouraged more great open source developers to work for Day. It isn’t just about the code.

In any case, the IDG article showed up in the evening of the 20th, in an InfoWorld blog, and then spread from there to several outlets. Boom! A rock has been heaved into the pond, and more ripples go forth. I’ve had four google alerts already today and it’s barely past noon. We’ve got bloggers who are blogging about blogs that comment on other blogs that discuss a blog that referenced an email message that I wrote last week. Now, if I could just get them to comment on what I wrote, instead of just commenting on the commentary… sigh.

Oh, right, I have a blog now. I’ll just toss another pebble in …

[Update 1: fixed the spelling of MySQL, pointed to their user community, and explained a bit more about open development at Day.]

[Update 2: clarified that Stephen’s piece is separate from the IDG quote and removed an assumption of how IDG picks its sources.]