Linux maintains bugs: The real reason ifconfig on Linux is deprecated

In my third installment of FreeBSD vs Linux, I will discuss underlying reasons for why Linux moved away from ifconfig(8) to ip(8).

In the past, when people said, “Linux is a kernel, not an operating system”, I knew that was true but I always thought it was a rather pedantic criticism. Of course no one runs just the Linux kernel, you run a distribution of Linux. But after reviewing userland code, I understand the significant drawbacks to developing “just a kernel” in isolation from the rest of the system.

Lets say a userland program wants to request an object from the kernel. The kernel structure might be something like this:

struct foo {
     size_t size;
     char name[20];
     int val;
};

On POSIX systems, a typical way to communicate with the kernel is to open a file descriptor to the appropriate system and send an ioctl(1) with a pointer to where the kernel should store the responding data. FreeBSD might perform this task as follows:

struct foo x;
ioctl(fd, CMD_REQUEST_FOO, &x);

Linux should do the same and to be fair it typically does. This manifests as software source that requires the Linux kernels headers. But because userland tools are maintained independent of the kernel, and sometimes are even explicitly written to be cross-platform, they typically maintain their own copy of data structures and macros independent of the Linux source tree.

So far so good. This might even produce the exact same binary output. But what happens if the kernel structure or behavior changes? This could be due to a bug fix, an added feature or an optimization – either way, the structure may change.

On FreeBSD this is not a problem. They update the kernel and userland tools in tandem. In fact, because both the kernel and userland application are in the same source tree they can even share the same header files. For 3rd party userland applications, FreeBSD provides highly stable libraries that do all the kernel-interactions, such as lib80211(3) – its worth noting that OpenBSD and NetBSD do not have these libraries because the kernel interface itself is highly stable anyways. FreeBSD even provides a COMPAT layer in the rare cases that an older binary fails to run on modern versions of FreeBSD.

Conversely on Linux, because the kernel and the rest of the operating system are not developed in tandem, this means updating or fixing a kernel struct would almost guarantee to break a downstream application. The only to prevent this would be to conduct regular massively coordinated updates to system utilities when the kernel changes, and properly version applications for specific kernel releases. Quite a herculean endeavor. This also explains why systemtap, one of Linux’s many answers to dtrace(1), does not work on Ubuntu.

Also, Linux can never have an equivalent of a lib80211(3) because there is no single standard library set. Even for the standard C library set, Linux has Glibc, uClibC, Dietlibc, Bionic and Musl. Rather than guessing the underlying C library implementation or falling into “dependency hell“, applications default to the most low-level implementation or their requested functionality. Some tools, such as ifconfig(8), resort to just reading from the /proc filesystem.

Linux’s solution to this problem was to create a policy of never breaking userland applications. This means userland interfaces to the Linux kernel never change under any circumstances, even if they malfunction and have known bugs. That is worth reiterating. Linux maintains known bugs – and actively refuses to fix them. In fact, if you attempt to fix them, Linus will curse at you, as manifest by this email.

And this leads back to the topic. Have you ever wondered why nearly every distribution deprecated ifconfig(8), a standard networking tool dating back to classic Unix? When Linux first implemented multiple IPv4 addresses on the same physical interface, it did so by cloning the interface in software and assigning each clone a unique IPv4 address. For example, eth0 could be cloned with eth0:1, eth0:2, etc. From a programmatic perspective, eth0 still only had one IPv4 address. As time passed and developers updated the kernel, it allowed users to assign multiple IPv4 addresses directly to the same interface., bypassing the need for cloning.

But Linux’s API has not changed. It still only returns a single legacy IPv4 address per interface. An interface could have multiple IPv4 addresses but ifconfig(8) will still only report a single address. In other words, as it currently stands ifconfig(8) lies to you. I do not fully understand they did not just update ifconfig(8) – random IRC rumors say there was a failed attempt due to ifconfig(8)’s convoluted code-base. But for whatever reason, this led to the completely new tool ip(8).

By contrast, FreeBSD just updates their ifconfig(8) in tandem with any kernel updates and there were no problems. Simple.

This also explains why Linux has multiple tools for seemingly highly correlated network tasks. Rather than working together to create a consolidate tool, Linux has iw(8), iwconfig(8) and brctl(8), etc, whereas FreeBSD just has different drivers for its ifconfig(8) implementation. For the record, I think ip(8)’s syntax is cleaner than ifconfig(8)’s syntax, as the latter is a victim of IPv4 legacy syntax. If both tools worked just fine, it might be worth having ifconfig(8) for legacy scripts during a transitionary period, but making ip(8) the future. That would be perfectly fine, but it would be ideal if both tools just worked, rather than needing to abandon the tool because it is broken.

Written with love a laptop running OpenBSD 6.3.

Thoughts?

fsync(2) on FreeBSD vs Linux

Even with our modern technology, hard-disk operations tend to be the slowest and most painful part of any modern system. As such, modern operations implement buffering mechanism. In this schema, when an application calls write(2), rather than immediately performing physical disk operations, the operating stores data in a kernel buffer. When the buffer exceeds a certain amount or the when an application falls the fsync(2) system call, the kernel begins writing to the disk.

This scheme is significantly faster, perhaps most demonstrably by the massive performance differential between the GNU vs BSD yes(1), as initially noted here. Note: FreeBSD’s yes(2) has now reached parity with GNU.

So far so good. But what happens when a disk write operation fails? This could be due to a hardware or network failure, but ultimately it is not the fault of the operating system. However, the operating must properly handle the failure.

On Linux, when an application’s fsync(2) call fails, the kernel returns a disk error. However, it then clears the buffer and properly sets the buffer as “dirty” (EIO flag). When the application issues another fsync(2) and the disk succeeds, the kernel clears the error bit, and reports a successful write to the application. As such the previously failed data never hit the disk and, if discarded by the application, the data was lost.

On FreeBSD, when an application’s fsync(2) call fails, the kernel also returns an error. Similar to Linux, it also reports the error to the application. But unlike with Linux, it maintains the “dirty” bit, thus not re-writing over the kernel buffer, until the page buffer is cleared, even if the successive fsync(2) is successful. This way, the page data is not lost.

This is another example of the superiority of FreeBSD over Linux. FreeBSD can better survive a disk failure, while Linux’s implementation is fundamentally broken. In the past I have experienced Linux’s ext4 fail into read-only mode to prevent disk corruption. While that might be a fall-back mechanism, it is not a long-term solution. Instead, userland applications have to keep track of whether the kernel was successful or not. Depending on your perspective, this is a stack violation.

Additionally, any long-term solution to change the behavior of the operating system would mean all user-land applications would potentially break. Linus Torvalds has notoriously stated:

Breaking user programs simply isn't acceptable

In fact, he’s repeated this policy in more colorful language here. So you’re stuck with bad behavior.

Now consider if you want to build an operating system that will run for potentially a hundred years and produce zero errors or catch errors and properly perform exception handling. Go with FreeBSD.

Its worth noting that Illumos (Solaris) properly implements fsync(2), whereas OpenBSD and NetBSD also failed on this issue and I fully anticipate them to fix the problem.

Linux kernel code vs FreeBSD kernel code

Linux driver code contains some serious garbage. I heard this refrain, but I did not realize how bad it was until I looked at it myself. Here is just one example.

Device drivers typically read static memory, typically known as EEPROM or ROM, from the chip to identify version, hard-coded information, device capabilities, etc. These values are used throughout execution of the driver. The reading process is among the first things when the device is attached and powered on.

In the case of FreeBSD, after the kernel reads the ROM, it uses a struct pointer with all the variables pre-populated, and points it at the ROM blob data stored in memory. For example:

struct r88e_rom {
	uint8_t		reserved1[16];
	uint8_t		cck_tx_pwr[R88E_GROUP_2G];
	uint8_t		ht40_tx_pwr[R88E_GROUP_2G - 1];
	uint8_t		tx_pwr_diff;
	uint8_t		reserved2[156];
	uint8_t		channel_plan;
	uint8_t		crystalcap;
#define R88E_ROM_CRYSTALCAP_DEF		0x20

	uint8_t		thermal_meter;
	uint8_t		reserved3[6];
	uint8_t		rf_board_opt;
	uint8_t		rf_feature_opt;
	uint8_t		rf_bt_opt;
	uint8_t		version;
	uint8_t		customer_id;
	uint8_t		reserved4[3];
	uint8_t		rf_ant_opt;
	uint8_t		reserved5[6];
	uint16_t	vid;
	uint16_t	pid;
	uint8_t		usb_opt;
	uint8_t		reserved6[2];
	uint8_t		macaddr[IEEE80211_ADDR_LEN];
	uint8_t		reserved7[2];
	uint8_t		string[33];	/* "realtek 802.11n NIC" */
	uint8_t		reserved8[256];
} __packed;

_Static_assert(sizeof(struct r88e_rom) == R88E_EFUSE_MAP_LEN,
    "R88E_EFUSE_MAP_LEN must be equal to sizeof(struct r88e_rom)!");

Notice the assertion at the bottom, which ensures that the ROM struct’s size equals a pre-defined length. The code will fail to compile if this assertion is not valid. Later, the kernel will instantiate a struct pointer and point it to the ROM, stored in the variable buf, as follows:

struct r88e_rom *rom = (struct r88e_rom *)buf;

Now, rom->channel_plan is set to the correct value. Simple.

Unfortunately, this is not how the same code is written on Linux. As mentioned, the Linux driver also begins by reading the ROM blob and storing it in a value called hwinfo. But rather than creating an equivalent struct pointer, the Linux code uses offset values of the ROM on an as-needed basis. For example, the driver reads the channel_plan as follows:

rtlefuse->eeprom_version = *(u16 *)&hwinfo[params[7]];

In this example, params[7] comes from a list of ROM offsets values set in the previous calling function. (That alone made tracing difficult.) The rtlefuse->eeprom_version is now the same as FreeBSD’s rom->version. This manual process repeats for every variable in the ROM.

While that may be just annoying and require a negligible bit more CPU power, this is not be a problem if it was done all in one place. But instead, the driver reads from the hwinfo blob on a seemingly as-needed during execution. And because these as-needed instances are during normal execution, the driver reads-in the same static value from hwinfo every a simple WiFi function occurs, such as changing the channel.

Okay, but even that might not be too difficult…right? Here’s the real kicker.

Sometimes, the driver works by using incrementing offsets from the ROM blob. For example, consider at read_power_value_fromprom (in drivers/net/wireless/realtek/rtlwifi/hw.c). It initializes eeaddr as a u32 (uint32_t), then assigns it with the offset value EEPROM_TX_PWR_INX. So far so good. But then, rather than using new offsets for every successive value, it increments the eeaddr value in multiple doubly-nested for-loops. Here is a simplified version of the code:

for (rfpath = 0 ; rfpath < MAX_RF_PATH ; rfpath++) {
		/*2.4G default value*/
		for (group = 0 ; group < MAX_CHNL_GROUP_24G; group++) { pwrinfo24g->index_cck_base[rfpath][group] =
			  hwinfo[eeaddr++];
			if (pwrinfo24g->index_cck_base[rfpath][group] == 0xFF)
				pwrinfo24g->index_cck_base[rfpath][group] =
				  0x2D;
		}
}

Notice the line hwinfo[eeaddr++]! Merely reading in that variable changes the offset. Its the Heisenberg Uncertainty Principle equivalent of code. This is a cleaned-up version of the 188-line function. The actual function has 6 nested for-loops, some with if-statements, each incrementing the eeaddr parameter as they go along.

Why would anyone do it this way? You are needlessly using up the CPU, making the code difficult to follow, repeatedly reading in static values and making any minor modifications and re-ordering or re-structuring will essentially break the entire function.

And perhaps the worst offender is when 20 functions deep you are not even working with hwinfo anymore. You are working to a pointer to hwinfo that has been incremented God-knows where, with their own offsets that are near impossible to track down.

In my efforts to port this driver to FreeBSD, I literally resorted to printing out the entire ROM, manually finding the memory, and backing into the equivalent offset. Other bizarre code: I have seen if-conditions that are impossible to reach, misplaced code that should go in the previous function, code that does bits of a tasks, while another function does the entire task – so repeat code, unnecessarily repeated code, etc.

How does this make it into the Linux Kernel?

To be fair, this does not appear to be the fault of Larry Finger, who maintains this driver. This is the fault of Realtek, for vomiting this terrible driver in the first place, providing absolutely zero documentation and refusing to respond to any contact attempts.

I hope my FreeBSD port is cleaner and more performant!