The Great Graveyard in the Cloud

or, How Code Dies

In physics, we understand that any system left untampered will move towards a state of maximal entropy.

For most of us, the idea of entropy is conflated with the idea of disorder. After all, if you don’t tidy your workshop, it will tend towards a disordered state very rapidly.

While an incomplete understanding of entropy, which Dr. Sabine Hossenfelder goes into far more eloquently than I can, it will serve to illustrate some of the arguments I’ll make here.

This is the second post inspired by A. Wilcox’ notes on the removal of the Linux/ia64 port from the mainline Linux kernel – and I’d like to talk a little about why code is removed from the kernel to begin with.

Maintenance Burden

hanging on in quiet desperation is the English way…

Any large software project – and in this case, I mean, any software project which has to be divided into parts for which one person is overall responsible – is going to have the contributions of different people, with different skill sets, and different approaches to problem solving.

You may have recalled that some pieces of software have been open-sourced, only after lengthy negotiations.

Some code that shipped with Netscape, for instance, did not ship with Mozilla – because Netscape Communications Corporation didn’t actually own that code. It needed to be removed. A similar thing occurred with StarDivision’s StarOffice, which would later become OpenOffice. There was a brief period of time where StarOffice existed as a commercial product – based on OpenOffice, but with the added components reattached, and commercial support provided. Such was also the case, as I recall, with Netscape 6 and Netscape 7 being based on Mozilla.

Speaking of Netscape – the Mozilla source code that would be released as part of Netscape’s commitment to open-source their software had precious little in common with the Communicator and Navigator releases before it. The Communicator codebase was, in 1998, still using pieces of code first written in 1994. During the 1990s, when technology was moving at a breakneck pace, these older fragments of code, some of which had not seen significant change since they were first written, were ultimately a liability.

When we talk about this problem now, we refer to refactoring a codebase. Refactoring is the process of rewriting existing code in line with modern best practice, while perfectly preserving its external behaviour. If the code is part of a larger project, we may be able to adjust the ways that we communicate with the code – its API, or Application Programming Interface – knowing that we are the only project depending on its specific behaviour.

With the prospect of a major version increase, Netscape was able to rewrite a large amount of its codebase in line with the best practices of 1998 for release as the Mozilla Application Suite, and then, Netscapes 6 and 7. But it wasn’t quite a refactor – Netscape 6, and Mozilla 1.0, both worked quite differently from Netscape Communicator 4.7.

You’ll notice that I refer to older fragments of code, not the codebase as a whole. In a moment I’ll talk about bit rot – both the physical and spiritual kinds – but first, I’d like to explain that a large enough codebase becomes almost organic in its nature. Using the example of the Mozilla codebase – running the published source code of Mozilla 1.0 through the tool `cloc` tells us that it has over 3.67 million lines of code across over 17,000 files. This count ignores documentation, code commentary, and blank lines. Rewriting a section of code to be cleaner, or more efficient, often takes a back seat to adding features or fixing bugs. When you have many such sections that have wanted attention for many years, a codebase may start to look more like a cancerous mass than a well-engineered machine.

The process of keeping that cancerous mass running and functional is the maintenance burden. Maintenance burden is in keeping documentation up to date, in fixing bugs that may occur as software is introduced to new environments, and in keeping up with external forces that require the addition of features to a mature codebase.

Sometimes, the correct call is to simply rewrite the entire piece of software. But, all too often, that takes time that project maintainers simply don’t have; so, instead, like a cancer, the software grows and contorts and deforms under load.

Bit Rot

and if your head explodes with dark forebodings too…

Bit rot originally referred to the possibility of time and adverse environmental conditions destroying the physical media used to store digital data. As the sometimes suboptimal storage conditions of paper and vellum challenge archivists and researchers, so now does the fact that we’ve been making magnetic media of increasing density since IBM’s Model 350 in 1956. The mechanism by which flash memory, such as USB sticks and modern solid-state drives, stores data, poses additional challenges to recovery of damaged media.

Just like the somewhat fungible magnetic media that for so long was the mainstay of computer storage, language, too, changes and evolves – shifting to meanings that could not have been imagined when the words were first penned. And from this, we develop a second concept of bit rot.

Bit rot thus also refers to this … cancerous uncontrolled organic lossage that occurs in software projects.

Sometimes, external components on which a software product depends change – such an example would be code that depends on a specific compiler’s features, which becomes defective because that compiler produces incorrect code for some reason in the future.

Other times, a codebase becomes so large and unwieldy that understanding it becomes difficult; bugs may accumulate and simply not be able to be fixed, as the maintainer does not comprehend entirely how the undesired behaviour has come to be.

I’ve started with the example of Mozilla because by comparison to the Linux kernel, it’s small. It’s the Linux kernel that I’m concerned with here – you’ll remember that Mozilla 1.0 has 3.67MLOC over 17,000 files. The same tool, `cloc`, applied to the Linux kernel version 6.6, tells us that it has 26.23MLOC over 70,280 files. That’s seven times as many lines of code and roughly four times the amount of files.

With such a large project, it is by choice and not by chance that the software remains fit for purpose. This requires active effort to find bugs before they can affect users, as well as continually making sure that parts of the software interoperate through well-defined interfaces that remain stable and consistent. These interfaces are defined not only in code but in documentation, to further ensure that breaking changes don’t sneak past.

These active efforts involve automated testing such as linting and fuzzing (checking for patterns of data manipulation that might one day result in error, and providing deliberately malformed data into programmes, respectively.) They also ultimately involve developers with appropriate environments testing the software directly. In the case of Itanium support, this means developers who have both an Itanium system (preferably several in different configurations) to test with, as well as, ideally, enough understanding of system internals to identify incorrect behaviour and potentially rectify it.

The Unsinkable R.M.S. Itanic

for the want of the price of tea and a slice…

As covered in my previous post, Intel has made some things that were explosively successful on the market, as well as some things that weren’t; Itanium falls firmly into the latter category.

Itanium machines were obscenely priced to begin with – promising performance the likes of which mere mortals had never known, and the first generation machines were largely used to get vendors to write or port software for Itanium. Second generation Itaniums, which had actually started to develop something that vaguely looked like performance for the price, were overshadowed by AMD’s Opteron hardware, which was affordable, available, and allowed backward-compatibility with existing x86 software.

The number of Itanium machines that were “out in the wild”, so to speak, was never very high. In 2007, according to Gartner Research, the total number of Itanium servers sold was about 55,000 – across all vendors. Compare this to the roughly 8,400,000 x86 machines sold in the server space alone, and it’s easy to understand why one can find an old x86 server more readily than an old Itanium server.

If we’re lucky, there might be two or three million Itanium systems that were sold in total. Some of those will have been sold to clients whose data security policies required that the systems were physically destroyed after their decommissioning. Some will have been sold to clients who were happy to mix them into typical e-waste; they may come up at auction or be sold by companies that specialise in ex-commercial hardware. Some may have been sold to private individuals; workstations like the HP zx6000 did exist, though it’s unclear how many were purchased.

The parts to keep these systems running must come from other systems; they will be expensive. The systems themselves are expensive. The Itanium machines are power-inefficient by comparison to even x86 (the 32-bit version) for general computational tasks; the cost of electricity is kind of a big deal when you’re responsible for testing hardware.

Debian maintainer Dr. John Paul Adrian Glaubitz has some Itanium hardware. You’d hope so, as he’s part of the team that keeps Itanium running on Debian, and was able to get a stay of execution on Itanium’s removal from the Linux kernel by offering to take up maintainership for that architecture. One man alone cannot really fight the accumulation of cruft in a codebase with so many moving parts – regression testing each commit piecemeal by compiling and building is viable with a farm of many computers, and many humans co-ordinating that testing; one man, no matter how driven, taking on such a task would be merely tilting at windmills, even if he had, say, sixty-four Itanium boxes.

The amount of free software developers who have Itanium hardware worldwide is low. Not all of them understand the Linux kernel. Some may be working on the NetBSD port instead. After the initial burst of investment into making Linux run on the Itanium in 1999, effort dwindled before petering out.

Itanium was nice while it lasted, I’m assured by people who had access to Itanium machines. Compared to so many other architectures which saw far more widespread use, even if it was thirty years ago now (looking at you, Linux/m68k), Itanium is near-impossible to maintain and develop for simply because there’s so little of it out there.

Your favourite architecture probably isn’t next.

long you live and high you fly…

There has been some consternation suggesting that the removal of Itanium from the Linux kernel is opening the floodgates to the mass deprecation of architectures that didn’t win, in historic terms, the race.

This is wrong. The last time that support for an architecture was removed from the Linux kernel was for the venerable Intel 386, the CPU that Linux was initially built for. Intel stopped making the 386 in 2007, many years after it was no longer suitable for the average user to do anything on. The 386 was removed over ten years ago, well after people were trying to put Linux onto such old machines as anything more than a curiosity.

The 386 was removed because the 486 which came after it was such a quantum leap that it became the minimum viable processor to programme in a “modern” way – the cost of continuing to support a processor that basically nobody used exceeded the benefit of trying to keep that section of the codebase up with new developments.

This cost was increased by the fact that … quite bluntly, the sort of people who are doing kernel hacking are not the sort of people still running 386s for fun. Usually, anyway. There might be a 386 or two in my house.

In the ARM ecosystem, it’s a little different; as basically every ARM CPU is also its own system, there are sometimes attempts to prune away specific ARM CPUs that have had no maintenance and no users for a number of years – but, the overall support for ARM remains, and support for a new CPU – or, readding an old CPU, should it be required, is considerably less technically difficult.

SPARC has devotees who’ve been able to collect SPARCstations – the older 32-bit and the newer 64-bit both. 32-bit SPARC isn’t a huge maintenance burden on top of understanding 64-bit SPARC.

PowerPC … it has been contemplated that some of the stranger SoC variants such as PowerQUICC might not be tenable to support. Much like ARM, these are cases where the CPU is actually an entire device unto itself, and not the removal of an architecture.

m68k? No. The Amiga nerds aren’t going to let that happen. Linux developers take note: they walk among you. Someone you know, perhaps someone you love, is quietly composing love letters to their Amiga 1200 at night.

As for the rest? As long as there’s hardware working to test on, and people willing to shoulder the load of maintaining the architecture, the future is promising indeed.

If you want a specific bit of hardware supported, you should probably buy some to sponsor a kernel hacker. After all – while kernel hackers aren’t generally working on 386s, and haven’t been for thirty years, they do enjoy a challenge.