This is a post about error handling in the C programming language in general, and in HelenOS in particular.

First, a bit of context. C traditionally doesn’t have very strong type system, especially when it comes to integer values. There is basically no support for defining new integer types – there are char, signed char, unsigned char (yes, those are three distinct types; while char is semantically identical to one of the other two, it’s still separate and not just an alias), short, unsigned short, int, unsigned int, long, unsigned long, long long, unsigned long long, and _Bool. That’s the exhaustive listing for standard types. On several platforms, compilers also define the non-standard __int128, but it’s not universally supported.

Other non-stardard integer types are allowed, but this is not done in practice. All the other types you get from various header files – e.g. int32_t, size_t, wchar_t, etc. – are all just aliases to one of the above, a measly #define int32_t int (although these days typedef is more commonly used, semantically, it makes no difference – typedef creates an alias, not a new type). Enumerated types defined via enum are no different. Although recent compilers come with scores of diagnostics for enum types and their constants, it’s a far cry from strong type checking. C++ gained its enum class some time ago, but C, sadly, doesn’t have anything like that.

It comes as no surprise, then, that most C code doesn’t really distinguish between various numeric types, whether they are enumerations, bitflags, or file descriptors. I jokingly call it the “all-int” situation. See a parameter or a return value typed int? Well great, you learned next to nothing. It could be anything. The designation has no semantic value.

Naturally, this extends to error-handling. In HelenOS code base, the go-to error handling mechanism has been to return negative error codes on failure and positive valid returns on success. Correspondingly, its <errno.h> header defined negative constants, contrary to the C language standard. This led to problems. Interfacing HelenOS libraries with code written for standard environment (typically POSIX) has been more painful than necessary, and where using standardized error codes just doesn’t cut it, domain-specific error codes have been used with mixed results. On several occasions, different kinds of error returns have been mixed improperly, resulting in hidden bugs that only manifest in the rare exceptional conditions.

Towards the solution

The issue with negative error codes is probably the single greatest blocker for a standards-compliant libc in the heart of HelenOS. However, since the code depends on them being negative, just changing the constants would break pretty much everything. Annoyingly, just separating error returns from actual results is not by itself sufficient, because some code would still (improperly) check for negativity, and it wouldn’t help with existing error handling bugs, or with bugs inadvertently introduced during the transition.

My first attempt was to simply rename the constants and keep them negative, reintroducing standard error codes on a case-by-case basis. This turned out to be a spectacularly useless idea. It would create many problems and probably cause more pain than it solved. I still thought the solution would be in splitting the errors into independent, API-specific groups, but had little idea how to turn that into practice. At the very least, I decided it would help to introduce the C11 errno_t type, and see where it goes.

Then, a week ago, Jiří Svoboda started his own efforts of separating error returns from valid results, which at the time duplicated/conflicted-with my own efforts. However, this pointed me back to the idea of adding output parametes instead of working with negative returns by another name, something that I originally dismissed as distruptive. After a short e-mail conversation, I asked Jiří to give me until the end of the week to work on this my way, to which he agreed.

Solving all the issues by the end of the week, in the entire code base? Insane! Well, not quite. And I would have managed if I didn’t make some silly mistakes in the process, but I digress. I was already considering how to utilize compiler diagnostics to detect problems, so when Jiří started separating the error values, I got an idea how to exploit it fully.

s/int/errno_t

The idea is simple. If we mark every error value by a specific type (such as errno_t, because why not?), then we can make the compiler fail-out on every instance of errors getting mixed with non-errors. “But wait,” you say, “didn’t you just explain that C can’t do that?”. Well, sort of. You see, the typing doesn’t necessarily have to make sense or work at runtime, it just needs to typecheck. If the typechecker guarantees that no mixing is happening, you can change the type and constants after the fact and the guarantee still applies (at least until you make new bugs). And C actually does have decent diagnostics for various types, even if not all of them in any single type.

So I started by defining errno_t to be a unique pointer type, and all Exxxx constants to be pointers of that type. This gives us some rather strong guarantees: no assigments from or to other types without explicit casts, no comparison to integers (except for equality with zero, which doesn’t hurt us), no printf as an integer (not strictly a problem, but it’s always nice to see a string representation instead of a random number).

That leaves the issue of actually changing the type of thousands of instances of function parameter/return values and variables. As Jiří pointed out, in HelenOS almost every function that returns int returns an error code. Which is exactly what makes it easy. We can just mechanically rename all int return types, along with a select few variable names (rc, ret, retval, a few others that came up). There are far fewer exceptions than there are errors, so doing the automatic replace and then fixing the problems is much easier than going the other direction (remember, at this point errno_t is type-checked, so there’s no way to miss an errno_t variable that has a non-error number assigned). And the great thing about it is that applying a reverse rename from errno_t to int doesn’t change semantics and gives a nice, manageable diff of actual changes. Naturally, there are a lot of instances where new variables had to be introduces to separate errno errors from other numbers, but faced with the certainties we get in exchange, it’s a rather small price to pay.

It was still a lot more demanding that I anticipated, mostly because I made some mistakes early on that forced me to redo a lot of the work (automatic renames can be tricky to use right), but I still consider it well worth the effort. As of now, I finished userspace, with major changes committed and remaining minor changes (and final gargantuan reverse-automatic-rename patch) pending review. Kernel is still in the works (the uspace part exhausted me), but should be ready in a few days.