systems engineering


As you may have noted, my last post seems to have hit a nerve in various communities, particularly with those who are convinced that REST means HTTP (because, well, that’s what they think it means) and that any attempt by me to describe REST with precision is just another elitist philosophical effort that won’t apply to those practical web developers who are just trying to get their javascript to work on more than one browser.

Apparently, I use words with too many syllables when comparing design trade-offs for network-based applications. I use too many general concepts, like hypertext, to describe REST instead of sticking to a concrete example, like HTML. I am supposed to tell them what they need to do, not how to think of the problem space. A few people even complained that my dissertation is too hard to read. Imagine that!

My dissertation is written to a certain audience: experts in the fields of software engineering and network protocol design. These are folks with long industry careers or graduate degrees, usually Ph.D.s who have spent decades learning about their field, identifying an untrodden path to pursue advanced research, and eventually becoming so familiar with that path that they are (hopefully) able to learn something that nobody else in the world has revealed before. In the process, they become specialists, because it is only through specialization that a human being can become sufficiently knowledgeable to find what has yet to be known in a field as large as computing. It is only by becoming specialists that we can understand each other when we explain what we have learned, and thereby grow the field of knowledge over time.

James Burke described the problem of specialization in the final episode of his first series on BBC, Connections. If you are having a hard time following my work, then I strongly suggest you go find a copy of the old episodes somewhere and watch them, bearing in mind that it was first broadcast in 1978, when the folks who brought you the Web were at their most impressionable early age. Mr. Burke would appreciate that connection, I think. What he said deserves a bit of transcripting on my part:

The other, general thing to be said about how change comes about through innovation, and especially about the rate in which that change occurs, is that: the easier you can communicate, the faster change happens.

I mean, if you look back at the past, in that light, you’ll see that there was a great surge in invention in the European Middle Ages, as soon as they had reestablished safe communication between their cities, after the so-called dark ages. There was another one, in the sixteenth century, when these [books] gave scientists and engineers the opportunity to share their knowledge with each other, thanks to a German goldsmith called Johannes Gutenberg who’d invented printing back in the 1450s. And then, when that developed out there, telecommunications, oh a hundred-odd years ago, then things really started to move.

It was with that second surge, in the sixteenth century, that we moved into the era of specialization: people writing about technical subjects in a way that only other scientists would understand. And, as their knowledge grew, so did their need for specialist words to describe that knowledge. If there is a gulf today, between the man-in-the-street and the scientists and the technologists who change his world every day, that’s where it comes from.

It was inevitable. Everyday language was inadequate. I mean, you’re a doctor. How do you operate on somebody when the best description of his condition you have is “a funny feeling in the stomach?” The medical profession talks mumbo jumbo because it needs to be exact. Or would you rather be dead?

And that’s only a very obvious example. Trouble is, when I’m being cured of something, I don’t care if I don’t understand. But what happens when I do care? When, say, the people we vote for are making decisions that effect our lives deeply, `cause that is, after all, when we get our say, isn’t it? When we vote? But say the issue relates to a bit of science and technology we don’t understand? Like, how safe is a reactor somebody wants to build? Or, should we make supersonic airplanes? Then, in the absence of knowledge, what is there to appeal to except our emotions? And then the issue becomes “national prestige,” or “good for jobs,” or “defense of our way of life,” or something. And suddenly you’re not voting for the real issue at all.

[James Burke, “Yesterday, Tomorrow and You” (19:02-21:30), 1978]

Still timely after all these years, isn’t it?

As scientists go, I am a generalist: the topics that I care about range from international politics to physics, with most applications of computing somewhere in between. However, when I send out a message to API designers, I expect the audience to be reasonably competent in the field. I have to talk to them as a specialist because I want them to understand, as specialists themselves, exactly what I am trying to convey and not some second-order derivatives. Most of the terms that I use should already be familiar to them (and thus it is a waste of everyone’s time for me to define them). When there is a concern about a particular term, like hypertext, it can be resolved by pointing out the relevant definitions that I use as an expert in the field.

I don’t try to tell them exactly what to do because, quite frankly, I don’t have anywhere near enough knowledge of their specific context to make such a decision. What I can do is tell them what isn’t REST or that doesn’t fit my definitions, because that is something about which I am guaranteed to know more than anyone else on this planet. That’s what happens when you complete a dissertation on a topic.

So, when you find it hard to understand what I have written, please don’t think of it as talking above your head or just too philosophical to be worth your time. I am writing this way because I think the subject deserves a particular form of precision. Instead, take the time to look up the terms. Think of it as an opportunity to learn something new, not because I said so, but because it will do you some personal good to better understand the depth of our field. Not just the details of what I wrote, but the background knowledge implied by all the strange terms that I used to write it.

Others will try to decipher what I have written in ways that are more direct or applicable to some practical concern of today. I probably won’t, because I am too busy grappling with the next topic, preparing for a conference, writing another standard, traveling to some distant place, or just doing the little things that let me feel I have I earned my paycheck. I am in a permanent state of not enough time. Fortunately, there are more than enough people who are specialist enough to understand what I have written (even when they disagree with it) and care enough about the subject to explain it to others in more concrete terms, provide consulting if you really need it, or just hang out and metablog. That’s a good thing, because it helps refine my knowledge of the field as well.

We are communicating really, really fast these days. Don’t pretend that you can keep up with this field while waiting for others to explain it to you.

I need to address that teaser I included in my last post on paper tigers and hidden dragons:

People need to understand that general-purpose PubSub is not a solution to scalability problems — it simply moves the problem somewhere else, and usually to a place that is inversely supported by the economics.

I am talking here about the economics of sustainable systems, which I consider to be part of System Dynamics and Systems Engineering. Software Architecture and System Dynamics are deeply intertwined, just as Software Engineering and Systems Engineering are deeply intertwined, but they are not the same thing.

Fortunately, this discussion of economics has nothing to do with the financial economy or the proposed bailout on Wall Street. At least, I don’t think it does, though event-based integration and publish-subscribe middleware are most popular within financial institutions (and rightly so, since it matches their natural data stream of buy/sell events and news notifications that drive financial investing/speculation).

Economists have a well-known (but often misunderstood) notion of the economies of scale, which can be summarized as the effect of increased production on the average cost of production (see also Wikipedia and Investopedia). For most types of business, the average cost of producing some item is expected to go down as the business expands. This is a natural consequence of both the spreading of fixed costs over many more customers and the ability of larger distributors to demand more competitive pricing from their upstream suppliers (e.g., Walmart).

If we look at such a business in terms of its system dynamics, we should see “load” on the system increase at a much lower rate than the increase in customers. In other words, the additional customers should not, on average, create more cost than they generate in revenue once the business passes the initial fixed infrastructure cost break-even point. If they do create more cost, then the business is not sustainable at larger scales. For many businesses, that’s just fine — most of the personal services economy is based on models with a small window between “enough customers to survive” and “too many customers to handle without degrading service.”

The same is true of software architectures for network-based applications. As the number of consumers increase, the per-consumer cost of the overall system must decrease in order to sustain the system. Relative economies of scale were used by Nikunj Mehta in his dissertation to compare architectural choices: “A system is considered to scale economically if it responds to increased processing requirements with a sub-linear growth in the resources used for processing.” I like that as a quantifiable definition for the architectural property of scalability. However, Nikunj is focused on comparing the relative scalability of particular implementations within an architecture-driven modular framework, not the type of economics that I alluded to in my post.

It is important to think of system dynamics in terms of economics and not just system throughput, since the impact of growth in non-sustainable systems is often unrelated to the individual data transfers or protocols in use. In fact, some architectures that are more efficient from the standpoint of per-event network usage are also more susceptible to inverse economies of scale. PubSub is one such example.

PubSub (short for publish/subscribe) is a derivative of one of the most analyzed architectural styles in software: event-based integration (EBI). From a purely theoretical perspective, event-based integration is a natural fit for systems that monitor real-time activities, particularly when the number of event sources outnumber the recipients of event notifications (e.g, graphical user interfaces and process control systems). However, EBI does not scale well when the number of recipients greatly outnumbers the sources. Researchers have been studying Internet-scale event notification systems for over two decades. Several derivative styles have been proposed to help reduce the issue of scale, including PubSub, EventBus, and flood distribution. There are probably others that I have left out, but they usually boil down into the use of event brokers (intermediaries) and/or a shared data stream (multicast and flood).

PubSub improves EBI scalability by filtering events, either by publisher-driven topics or subscriber-chosen content, and only delivering notifications to the matching subscribers of those topics/content. In theory, if we are careful in choosing the topics such that they partition the events into relatively equal, non-overlapping subsets, and if the consumers of those events are only interested in subscribing to a small subset of those topics, then the corresponding reduction in notification traffic is substantial. Group chat, as in Jabber rooms, is one such example. For most applications, however, such a partitioning does not exist.

EventBus takes a slightly different tack on scalability by forcing traffic onto shared distribution streams, usually IP multicast or flood-distribution channels, and having all subscribers pick their own events off that stream. Unfortunately, multicast does not scale socially (a tragedy of the commons) and rarely succeeds across organizational domains, whereas flooding only succeeds when the signal/noise ratio is high.

Of course, there are many examples of successful EBI systems. USENET news, for example, combined both hierarchical PubSub and flood-distribution to great effect. Adam Rifkin and Rohit Khare did an extensive survey of Internet-scale event notification systems that still stands the test of time. And, as I said at the beginning, EBI is the most popular style of system architecture within the financial industry, where scalability is offset by relatively high subscription costs.

So, what’s the point of this rambling post? Ah, yes, now I remember: the inverse economics of general-purpose PubSub.

People are greedy. They tend to be event-gluttons, wishing to receive far more information than they actually intend to read, and rarely remember to unsubscribe from event streams. When they stop using a service, they tend to remove the software that reports the notifications locally, not the software or subscriptions that cause it to be delivered. If it were not for automatic bounce handling, Apache mailing lists would have melted down years ago. The only way to constrain such users is a tax on either subscriptions or deliveries.

PubSub systems have an imbalance of effort to subscribe versus effort to deliver. In other words, a single user request results in a potentially large number of server obligations. As such, a benevolent user can produce a disproportionate load on the publisher or broker that is distributing notifications. On the Internet, we don’t have the luxury of designing just for benevolent users, and thus in HTTP systems we call such requests a denial-of-service exploit. Any request architecture that allows a malevolent user to create load as a multiple of the effort required to make the request will result in denial of service attacks, since they succeed without overwhelming the attacker’s own connectivity.

The only solution to the inverse economics of PubSub that I know of is to match the costs to the revenue stream. In other words, charge consumers for the subscriptions they choose to receive, at a rate corresponding to the cost of delivering the subscribed stream at the subscribed rate and over the subscribed medium. That is exactly why there is no standard mechanism for notifications in HTTP: there is no scalable solution for subscribing to events without also standardizing the creation of user accounts and event filtering mechanisms, neither of which tend to be standardized because user management is an application of its own and event filtering is always domain-specific.

What does this mean for a system like Jabber? It means that each jabber server (the broker) needs to keep watch on the types of systems being supported via its infrastructure and adjust their subscription model accordingly. Applications like chat will work great, as will most applications wherein the event sources far outnumber the notification recipients. Applications with inverse economics need to pay their own way.

What does this mean for a system like Twitter? Well, that’s an interesting case because it incorporates both subscription-based following of events and RESTful interfaces for event logs. When a user only follows a small set of twitters, or the set of twitters being followed are mostly quiet, then they might be more economically supported by an EBI system instead of polling their twitter home stream. However, that won’t be true for people who want to follow many twitters or who are only interested in looking at twits in batch (basically, anyone who polls at a lower frequency than their incoming twit rate). In particular, malevolent followers (those robots that follow everyone, for whatever reason) are much more dangerous to an EBI system.

My guess is that Twitter will eventually move to an ad-supported revenue stream on their Web interfaces, since ad revenue increases at the same rate as RESTful applications need to scale (i.e., RESTful systems have a natural economy of scale that makes them sustainable even when their popularity isn’t anticipated). Likewise, Twitter’s event-based interfaces will move toward a more sustainable subscription model that charges for excessive following and premium services like real-time event delivery via XMPP or SMS. That kind of adjustment is necessary to rebalance the system dynamics.