Including optimized-out kernel symbols in dtrace on FreeBSD

Warning: This is a hack that involves modifying the build scripts. tldr; modify /usr/src/sys/conf/kern.pre.mk to change all references of -O2 to -O0.

Have you ever had dtrace(1) on FreeBSD fail to list a probe that should exist in the kernel? This is because Clang will optimize-out some functions. The result is ctfconvert(1) will not generate debugging symbols that dtrace(1) uses to identify probes. I have a quick solution to getting those probes visible to dtrace(1).

In my case, I was trying to instrument on ieee80211_ioctl_get80211, whose sister function ieee80211_ioctl_set80211 has a dtrace(1) probe in the generic FreeBSD 11 and 12 kernels. Both functions are located in /usr/src/sys/net80211/ieee80211_ioctl.c.

My first attempt was to add to /etc/make.conf as follows and recompile the kernel.

CFLAGS+=-O0 and -fno-inline-functions

This failed to produce the dtrace(1) probe. Several other attempts failed and I was getting inconsistent compilation results (Is it me or is ieee80211_ioctl.c compiled with different flags if NO_CLEAN=1 is set?). When I manually compiled the object file by copying the compilation line for the object file and adding -O0 -fno-inline-functions, nm(1) on both the object file and kernel demonstrated that the symbol was present. I installed the kernel, rebooted and it was listed as a dtrace probe. Great!

But as I continued to debug my WiFi driver (oh yeah, I’m very slowly extending rtwn(4)), I found myself rebuilding the kernel several times and frequently rebooting. Why not do this across the entire kernel?

After hacking around, my solution was to modify the build scripts. My solution was to edit /usr/src/sys/conf/kern.pre.mk and modify all optimization level 2 to optimization level 0. The following is my diff(1) on FreeBSD 12.0-CURRENT.

diff --git a/sys/conf/kern.pre.mk b/sys/conf/kern.pre.mk
index c1bbf0d30bf..9a99f1065aa 100644
--- a/sys/conf/kern.pre.mk
+++ b/sys/conf/kern.pre.mk
@@ -57,14 +57,14 @@ CTFFLAGS+=  -g
.if ${MACHINE_CPUARCH} == "powerpc"
_MINUS_O=      -O      # gcc miscompiles some code at -O2
.else
-_MINUS_O=      -O2
+_MINUS_O=      -O0
.endif
.endif
.if ${MACHINE_CPUARCH} == "amd64"
.if ${COMPILER_TYPE} == "clang"
-COPTFLAGS?=-O2 -pipe
+COPTFLAGS?=-O0 -pipe
.else
-COPTFLAGS?=-O2 -frename-registers -pipe
+COPTFLAGS?=-O0 -frename-registers -pipe
.endif
.else
COPTFLAGS?=${_MINUS_O} -pipe

My dtrace -l | wc -l went from 71432 probes to 91420 probes.

A few thoughts:

  • This seems like a hack rather than a long-term solution. Either the problem is with the hard-coded optimization flags, or the inability to overwrite them in all places in make.conf.
  • Removing optimizations is only something I would do in a non-production kernel, so its as if I have to choose between optimizations for a production kernel or having dtrace probes. But dtrace explicitly markets itself as not impactful on production.
  • Using the dtrace pony as your featured image on WordPress does not render properly and must be rotated and modified. Blame Bryan Cantrill.

If you have a better solution, please let me know and I will update the article, but this works for me!

Linux maintains bugs: The real reason ifconfig on Linux is deprecated

In my third installment of FreeBSD vs Linux, I will discuss underlying reasons for why Linux moved away from ifconfig(8) to ip(8).

In the past, when people said, “Linux is a kernel, not an operating system”, I knew that was true but I always thought it was a rather pedantic criticism. Of course no one runs just the Linux kernel, you run a distribution of Linux. But after reviewing userland code, I understand the significant drawbacks to developing “just a kernel” in isolation from the rest of the system.

Lets say a userland program wants to request an object from the kernel. The kernel structure might be something like this:

struct foo {
     size_t size;
     char name[20];
     int val;
};

On POSIX systems, a typical way to communicate with the kernel is to open a file descriptor to the appropriate system and send an ioctl(1) with a pointer to where the kernel should store the responding data. FreeBSD might perform this task as follows:

struct foo x;
ioctl(fd, CMD_REQUEST_FOO, &x);

Linux should do the same and to be fair it typically does. This manifests as software source that requires the Linux kernels headers. But because userland tools are maintained independent of the kernel, and sometimes are even explicitly written to be cross-platform, they typically maintain their own copy of data structures and macros independent of the Linux source tree.

So far so good. This might even produce the exact same binary output. But what happens if the kernel structure or behavior changes? This could be due to a bug fix, an added feature or an optimization – either way, the structure may change.

On FreeBSD this is not a problem. They update the kernel and userland tools in tandem. In fact, because both the kernel and userland application are in the same source tree they can even share the same header files. For 3rd party userland applications, FreeBSD provides highly stable libraries that do all the kernel-interactions, such as lib80211(3) – its worth noting that OpenBSD and NetBSD do not have these libraries because the kernel interface itself is highly stable anyways. FreeBSD even provides a COMPAT layer in the rare cases that an older binary fails to run on modern versions of FreeBSD.

Conversely on Linux, because the kernel and the rest of the operating system are not developed in tandem, this means updating or fixing a kernel struct would almost guarantee to break a downstream application. The only to prevent this would be to conduct regular massively coordinated updates to system utilities when the kernel changes, and properly version applications for specific kernel releases. Quite a herculean endeavor. This also explains why systemtap, one of Linux’s many answers to dtrace(1), does not work on Ubuntu.

Also, Linux can never have an equivalent of a lib80211(3) because there is no single standard library set. Even for the standard C library set, Linux has Glibc, uClibC, Dietlibc, Bionic and Musl. Rather than guessing the underlying C library implementation or falling into “dependency hell“, applications default to the most low-level implementation or their requested functionality. Some tools, such as ifconfig(8), resort to just reading from the /proc filesystem.

Linux’s solution to this problem was to create a policy of never breaking userland applications. This means userland interfaces to the Linux kernel never change under any circumstances, even if they malfunction and have known bugs. That is worth reiterating. Linux maintains known bugs – and actively refuses to fix them. In fact, if you attempt to fix them, Linus will curse at you, as manifest by this email.

And this leads back to the topic. Have you ever wondered why nearly every distribution deprecated ifconfig(8), a standard networking tool dating back to classic Unix? When Linux first implemented multiple IPv4 addresses on the same physical interface, it did so by cloning the interface in software and assigning each clone a unique IPv4 address. For example, eth0 could be cloned with eth0:1, eth0:2, etc. From a programmatic perspective, eth0 still only had one IPv4 address. As time passed and developers updated the kernel, it allowed users to assign multiple IPv4 addresses directly to the same interface., bypassing the need for cloning.

But Linux’s API has not changed. It still only returns a single legacy IPv4 address per interface. An interface could have multiple IPv4 addresses but ifconfig(8) will still only report a single address. In other words, as it currently stands ifconfig(8) lies to you. I do not fully understand they did not just update ifconfig(8) – random IRC rumors say there was a failed attempt due to ifconfig(8)’s convoluted code-base. But for whatever reason, this led to the completely new tool ip(8).

By contrast, FreeBSD just updates their ifconfig(8) in tandem with any kernel updates and there were no problems. Simple.

This also explains why Linux has multiple tools for seemingly highly correlated network tasks. Rather than working together to create a consolidate tool, Linux has iw(8), iwconfig(8) and brctl(8), etc, whereas FreeBSD just has different drivers for its ifconfig(8) implementation. For the record, I think ip(8)’s syntax is cleaner than ifconfig(8)’s syntax, as the latter is a victim of IPv4 legacy syntax. If both tools worked just fine, it might be worth having ifconfig(8) for legacy scripts during a transitionary period, but making ip(8) the future. That would be perfectly fine, but it would be ideal if both tools just worked, rather than needing to abandon the tool because it is broken.

Written with love a laptop running OpenBSD 6.3.

Thoughts?

fsync(2) on FreeBSD vs Linux

Even with our modern technology, hard-disk operations tend to be the slowest and most painful part of any modern system. As such, modern operations implement buffering mechanism. In this schema, when an application calls write(2), rather than immediately performing physical disk operations, the operating stores data in a kernel buffer. When the buffer exceeds a certain amount or the when an application falls the fsync(2) system call, the kernel begins writing to the disk.

This scheme is significantly faster, perhaps most demonstrably by the massive performance differential between the GNU vs BSD yes(1), as initially noted here. Note: FreeBSD’s yes(2) has now reached parity with GNU.

So far so good. But what happens when a disk write operation fails? This could be due to a hardware or network failure, but ultimately it is not the fault of the operating system. However, the operating must properly handle the failure.

On Linux, when an application’s fsync(2) call fails, the kernel returns a disk error. However, it then clears the buffer and properly sets the buffer as “dirty” (EIO flag). When the application issues another fsync(2) and the disk succeeds, the kernel clears the error bit, and reports a successful write to the application. As such the previously failed data never hit the disk and, if discarded by the application, the data was lost.

On FreeBSD, when an application’s fsync(2) call fails, the kernel also returns an error. Similar to Linux, it also reports the error to the application. But unlike with Linux, it maintains the “dirty” bit, thus not re-writing over the kernel buffer, until the page buffer is cleared, even if the successive fsync(2) is successful. This way, the page data is not lost.

This is another example of the superiority of FreeBSD over Linux. FreeBSD can better survive a disk failure, while Linux’s implementation is fundamentally broken. In the past I have experienced Linux’s ext4 fail into read-only mode to prevent disk corruption. While that might be a fall-back mechanism, it is not a long-term solution. Instead, userland applications have to keep track of whether the kernel was successful or not. Depending on your perspective, this is a stack violation.

Additionally, any long-term solution to change the behavior of the operating system would mean all user-land applications would potentially break. Linus Torvalds has notoriously stated:

Breaking user programs simply isn't acceptable

In fact, he’s repeated this policy in more colorful language here. So you’re stuck with bad behavior.

Now consider if you want to build an operating system that will run for potentially a hundred years and produce zero errors or catch errors and properly perform exception handling. Go with FreeBSD.

Its worth noting that Illumos (Solaris) properly implements fsync(2), whereas OpenBSD and NetBSD also failed on this issue and I fully anticipate them to fix the problem.

Tracing ifconfig commands from userspace to device driver

I am currently working on expanding FreeBSD’s rtwn(4) wireless device driver. I have the basics down, such as initialization, powering on and off, loading the firmware, etc, and am now trying to fill in specific ifconfig(8) methods. This requires having an in depth knowledge of how ifconfig(8) commands pass are ultimately delivered to the driver. I could not find concise documentation that outlines each stage of the process. So I wrote one! 🙂

This article is specific to FreeBSD 12.0-CURRENT, but it should apply to any future version and other operating systems that utilizes net80211(4), such as OpenBSD, NetBSD, DragonFlyBSD and illumos. I hope it serves to help the FreeBSD community continue to develop WiFi and other device drivers.  This is not an exhaustive guide as there is far too many code, but it should provide you with the basic order of operations.

In this example, I will walk through changing the channel on your WiFi card and placing it in monitor mode as follows:

# ifconfig wlan0 channel 6 wlanmode monitor

High Level Summary

FreeBSD’s ifconfig(8) utilizes the lib80211(3) userspace library which functions as an API to populate kernel data structures and issue ioctl(2) syscall. The kernel receives the ioctl(2) syscall in a new thread, interprets the structure and routes the command to the appropriate stack. In our case this is net80211(4). The kernel then creates a new queued task and terminates the thread. Later on, a different kernel thread receives the queued task and runs the associated net80211(4) handler which immediately delivers execution to the device driver.

To summarize again:

Lets begin!

Userspace: ifconfig(8) + lib80211(3) library

Starting: ifconfig(8) executable

Startnig early in ifconfig(8), it opens a SOCK_DGRAM socket in /usr/src/sbin/ifconfig/ifconfig.c as follows:

s = socket(AF_LOCAL, SOCK_DGRAM, 0)

This socket functions as the interface for userspace to kernel communication. Rather than tracing from the if-else maze in main()1, I grepped for the string “channel” and found it in ieee80211_cmd[] defined at the end of /usr/src/sbin/ifconfig/ifieee80211.c. This table enumerates all ieee80211 ifconfig(8) commands. The “channel” command is defined as follows:

DEF_CMD_ARG("channel", set80211channel)

Note the second argument. I looked up DEF_CMD_ARG and found that it was a pre-processor macro that defines what function is run when the user sends ifconfig(8) a command. A quick grep search shows set80211channel is defined in /usr/src/sbin/ifconfig/ifieee80211.c. The parameters are fairly easy to identify: val is the new channel number (1 through 14) and s is the socket we opened earlier. This executes ifconfig(8)‘s set80211 function whose sole purpose is to cleanly transfer execution into the lib80211(3) library.

Userspace: lib80211(3) library

lib80211(3) is an 802.11 wireless network management library to formally communicate with the kernel. Its worth noting that neither OpenBSD nor NetBSD have this library and instead opt to communicate directly to the kernel.

As mentioned, ifconfig(8)‘s set80211 function calls lib80211_set80211, located in /usr/src/lib/lib80211/lib80211_ioctl.c. The lib80211_set80211 function populates an ieee80211req data structure, used for user-to-kernel ieee80211 communication. In the below example, this is the ireq variable, which contains the WiFi interface name and intended channel. The library then calls the ioctl(2), as follows:

ioctl(s, SIOCS80211, &ireq)

This runs the syscall to formally enter kernel-space execution. In essence, ifconfig(8) is nothing more than a fancy ioctl(2) controller. You could write your own interface configuration tool that directly calls the ioctl(2) syscall and get the same result. Now on to the kernel!

The Kernel: Kernel Command Routing to net80211(4)

There are two brief explanations before we proceed.

First, at a high-level the BSD kernel operates like an IP router in that it routes execution through the kernel, populating relevant data values along the way, until the execution reaches its destination handling functions. The following explanation shows how the kernel will identify the syscall type, determine that it is for an interface card, determine the type of interface card and finally queue a task for future execution.

Second, the BSD kernel utilizes a common pattern of using template methods that call a series of function pointers. The exact function pointers are conditionally populated, allowing the code to maintain a consistent structure while the exact implementation may differ. It works very well but can make tracing execution paths difficult if you are just reading the code straight through. When I had trouble, I typically used illumos’s OpenGrok or dtrace(1) .

Brief Dtrace Detour

Solaris’s dtrace(1) is a dynamic tracing tool imported to FreeBSD that is used to monitor a kernel or process in real time. It is useful in understanding what the operating system is doing and saves you the trouble of using printf(3)-style debugging. I used dtrace(1) in writing this guide identify what the kernel was executing, function arguments, and the stack trace at any given moment.

For example, if I wanted to monitor the ifioctl function, I might run this:

# dtrace -n '
> fbt:kernel:ifioctl:entry {
> self->cmd = args[1];
> stack(10);
> }
> fbt:kernel:ifioctl:return {
> printf("ifioctl(cmd=%x) = %x", self->cmd, arg1);
> exit(0);
> } '

This dtrace(1) one-line command sets up handlers for ifioctl‘s entry and return probes. On entry, dtrace(1) records the value of the 2nd argument cmd, and displays the last 10 elements of the stack. On return, it displays the function argument and return value. I used variations of this basic command template throughout my research, especially when I was confused in tracing the code or could not identify a function’s arguments.

Syscall Interception

The first non-assembly function is the amd64-specific syscall handler amd64_syscall that receives a new thread structure and identifies the type as a syscall. In our case it is for an ioctl(2) so amd64_syscall calls sys_ioctl located in /usr/src/sys/kern/sys_generic.c.

On FreeBSD sys_ioctl performs input validation and formats the data it receives. It then calls kern_ioctl which determines what type of file descriptor the ioctl(2) is working with, what the capabilities for the socket are and assigns the function pointer fo_ioctl accordingly. (NetBSD and OpenBSD do not have kern_ioctl. For them sys_ioctl directly calls fo_ioctl.) Our file descriptor corresponds to an interface, so FreeBSD assigns fo_ioctl as a function pointer to ifioctl, which handles interface-layer ioctl(2) calls. This function is located in /usr/src/sys/net/if.c.

Network IOCTL

The function ifioctl is responsible for all sorts of interfaces: Ethernet, WiFi, epair(4), etc. ifioctl starts with a switch-condition based on the cmd argument. This checks if the command can be handled by net80211(4) without needing to jump into the driver, such as creating a clone interface or updating the MTU. A quick dtrace(2) probe reveals that the cmd argument is SIOCS80211, which fails to meet any switch-conditions, so execution jumps to the bottom. The function continues and calls ifp->if_ioctl, which in the case of WiFi is a function pointer to ieee80211_ioctl, located in /usr/src/sys/net80211/ieee80211_ioctl.c.

WiFi IOCTL

ieee80211_ioctl contains another switch-case. With cmd set to SIOCS80211, execution matches the associated case and calls ieee80211_ioctl_set80211, located in /usr/src/sys/net80211/ieee80211_ioctl.c.

ieee80211_ioctl_set80211 has yet another switch-case with a few dozen conditions2. The ireq->i_type was set to IEEE80211_IOC_CHANNEL by lib80211(3) so it will match the associated case and execute ieee80211_ioctl_setchannel. The gist of this function is to determine if the input channel is valid or if the kernel needs to set any other values. It concludes by calling setcurchan, which does two things. First, it determines the validity of the channel and if any additional values must be set. Second, it runs ieee80211_runtask, that makes the final thread-level call to taskqueue_enqueue.

The Kernel: Task Execution

taskqueue_enqueue is not an ieee80211(9) function, but its worth a brief review. In a nutshell, the taskqueue(9) framework allows you to defer code execution into the future. For example, if you want to delay execution for 3 seconds, running the kernel equivalent of sleep(3) would cause the entire CPU core to halt for 3 seconds. This is unacceptable. Instead, taskqueue(9) allows you specify a function that the kernel will execute at a later time.

In our channel change example, the scheduled function is the net80211(4) function update_channel, located in /usr/src/sys/net80211/ieee80211_proto.c. When taskqueue(9) reaches our enqueued task, it will first initiate the update_channel handler to receive the task and immediately hand over execution to the driver code pointed to by ic_set_channel.

To summarize, up to this point the kernel has routed the command to the network stack, which routed to the WiFi-specific stack, where it was scheduled as a task for future execution. When taskqueue(9) reaches the task, it immediately jumps to the driver-specific code. At last, we entered the driver!

The Driver

From here on, the code is driver-specific and I will not get into the implementation details, as each device has its own unique channel changing process. I am currently working on rtwn(9), which is located in /usr/src/sys/dev/rtwn. NetBSD and OpenBSD separate USB and PCI drivers, so the same driver is located in /usr/src/sys/dev/usb/if_urtwn.c and /usr/src/sys/dev/pci/if_rtwn.c, respectively.

Operating Systems need a standard way to communicate with device drivers. Typically, the driver provides a structure containing a series of function pointers to driver-specific code and the kernel uses this as an entry-point into the driver code. In the case of WiFi, this structure is ieee80211com, located in /usr/src/sys/net80211/ieee80211_var.h. By convention, all BSD-derived systems use the variable name ic to handle ieee80211(9) methods.

In our case, we are changing the channel, so the operating system will call ic->ic_set_channel, which is a pointer to the driver’s channel changing function. For rtwn(9), this is rtwn_set_channel, which itself is a function pointer to r92c_set_chanr92e_set_chan or r12a_set_chan, depending on which specific device you are using.

The specifics of rtwn(9) are outside of the scope of this article, but it is worth discussing how the driver communicates to the hardware.

The softc structure is a struct that maintains the device’s run-time variables, states, and method implementations. By convention, each driver’s softc instance is called sc. You might wonder why you need yet another method function pointer when ieee80211com provides that. This is because ieee80211com‘s methods point to command handlers, not necessarily to device routines. A device drivers may have their own internal methods that are not part of ieee80211com. Also, the softc structure can handle minor variations between device versions. rtwn(9)‘s softc struct is called rtwn_softc and located in /usr/src/sys/dev/rtwn/if_rtwnvar.h.

How does a driver send data to the driver? rtwn(9) uses the rtwn_write_[1|2|4] and rtwn_read_[1|2|4] methods to actually send or receive a byte, word or double-word3. rtwn_read_1 is a pointer to the sc_read_1 method.

The driver assigns the sc_read class of functions at initialization to either the rtwn_usb_read_* and rtwn_usb_write_* methods or rtwn_pci_read_* and rtwn_pci_write_*. The aforementioned class of functions are abstractions to the PCI and USB buses. In the case of PCI, these function calls will eventually call bus_space_read_* and bus_space_write_*, which are part of the PCI subsystem. In the case of USB, the driver will call usbd_do_request_flags, which is part of the USB subsystem. A well-written driver should abstract these bus-specific layers and provide you with clean read and write methods for various data sizes. As an aside, FreeBSD is long overdue for an SDIO stack and this is a major impediment for the Raspberry Pi, Chromebooks and other embedded devices. But I digress…

As an example, the driver uses the following line to enable hardware interrupts.

rtwn_write_4(sc, R92C_HIMR, R92C_INT_ENABLE);

This will write the value R92C_INT_ENABLE to the R92C_HIMR device register.

The End

To summarize this long journey, the ifconfig(8) opens a socket and passes it to the lib80211(3) library. lib80211(3) sends a userspace-to-kernel command structure to the kernel with an ioctl(2) syscall. The syscall triggers the kernel to run a new kernel thread. From here, the kernel determines that theioctl(2) command corresponds to a network card, specifies the type as a WiFi card, then identifies the exact command type. The ieee80211(9) tells taskqueue to create a new task to change the WiFi channel, then terminates. Later on, the taskqueue(9) runs the ieee80211(9) task handler that transfers execution to the driver. The driver communicates to the hardware using the PCI or USB buses to change the WiFi channel.

In conclusion, in my opinion, FreeBSD is technically superior to Linux, but lacks in several critical areas, among which is hardware support. I hope this article serves the FreeBSD community to continue to produce high-quality, faster device drivers.

Thank you


Notes

  1. Linux has a point when they argue that the classic ifconfig(8) is antiquated. Its syntax is inconsistent and this is reflected in the spaghetti-code of if-then conditions.
  2. Note: on my FreeBSD 11.1-RELEASE kernel this function was optimized out, so dtrace(1) probes failed. You should be able to add CFLAGS= -O0 -fno-inline to your /etc/make.conf, but that did not seem disable the optimization for me. Your mileage may vary.
  3. Lets use rtwn_read_1 for now, but the concepts apply to the others.

[This article was also published in the January/February 2018 edition of the FreeBSD Journal]

Migrating from FreeNAS to FreeBSD

I love FreeNAS. Its awesome, well built, well-supported. But as my needs increased, I wanted to use my FreeNAS box for more than the basics. In particular, I was moving towards a single host to run as a:

  1. Family NAS server
  2. Development server
  3. IRC client
  4. VM server
  5. Web server
  6. Email Server
  7. Git Server
  8. Home Firewall
  9. Home IPv6 gateway
  10. IPv6 VPN and Jump box

FreeNAS could easily do all of this. But I found myself using the device for everything but a NAS server. Also, as my experience on FreeBSD reaching proficient-status, I wanted to jump in the deep end and manually configure a production system from scratch. So I thanked FreeNAS for their contribution, yanked out the USB disks and installed FreeBSD 11.1 on a separate USB disk.

During installation, I was careful not to touch the /dev/ada devices, as that would destroy my precious files. Instead, I installed to the second USB disk, /dev/da1, while the installation medium was /dev/da0. This was obviously a problem, because at reboot the USB disk would become /dev/da0 and the kernel would panic upon not finding a /dev/da1. So I dropped to the terminal and mounted zroot/ROOT/default volume,  which is the / directory, to /tmp/root as follows.

zfs set mountpoint=/tmp/root zroot/ROOT/default
zfs mount zroot/ROOT/default

Then I edited /tmp/root/etc/fstab and changed /dev/da1p2 to /dev/da0p2, umounted, reset the machine and FreeBSD booted without a glitch.

As mentioned, I plan on using this system fairly heavily going forward so the 8 GB USB disk would definitely not be sufficient. FreeBSD has an amazing feature where it isolates the base system from any user-installed applications or configurations. Rather than using symlink magic, my strategy was to store all application data on my two 4TB NAS disks.

First things first, I imported the pool as follows:

zpool import -f tank

The -f flag was necessary because for whatever reason ZFS thought tank was currently utilized. A quick zfs list revealed that FreeNAS had been mounting my disks to /tank. Unfortunately, the /tank directory is not utilized by default by FreeBSD. Therefore, I renamed each ZFS volume to a new /usr/local as follows. First, I created a zfs volume for tank/usr/share as follows.

zfs create tank/usr/local

Then I renamed the old paths to map to my new intended directory structure, as follows

zfs rename tank/old/path tank/usr/local/new/path
zfs set mountpoint=/usr/local/new/path tank/usr/local/new/path

This took a bit of time, but after completing these for all partitions, I ran:

zfs mount -a

With that, all ZFS shares were mounted as /usr/local subdirectories. All of my data was successfully migrated over without a single bit of data loss!

From here, I needed to re-create the jails. FreeNAS’s excellent jail web-based GUI allows you to create jails with their own independent network stack. This feature is called VIMAGE and is useful to isolate network services from the host FreeBSD system. VIMAGE is pre-compiled into the FreeNAS kernel. It is on by default on FreeBSD 12.0, but not 11.x and must be compiled in. To do this, you need to download and uncompress the src distribution, edit /usr/src/sys/amd64/conf/GENERIC and add in the following line:

options VIMAGE

Next, compile the kernel and install it as follows.

make -j 5 buildkernel
make installkernel

The -j 5 is because this machine is an i3 with 4 cores – feel free to adjust this depending on the number of cores you have.

With a successful reboot, I was now ready to migrate the jails over. I did so by moving the zfs jails volume to /usr/local/jail, such that my IRC client jail was /usr/local/jail/irc. Now the complicated part: Configuring the jails!

Since a jail using VIMAGE has a completely separate network stack, by default it renders a jail unable to communicate outside of itself. The way to allow communication you have to create an epair(4) pair and pass one side to the jail, as follows:

ifconfig epair create
ifconfig epair0a vnet JAILNAME

In this configuration epair0a would belong to the jail while epair0b would belong to the base FreeBSD host, such that they could communicate. But how to setup connectivity? I had a lot of options to have the jails connect outside, including:

  • Being on the same subnet (192.168.1.0/24)
  • Being on a separate VLAN from the rest of the network (might be the long-term plan)
  • Have a single VLAN, have legacy IPv4 addresses identifiably different for ease, but have a single IPv6 network. I opted for this for now. Its simple and works.

This means creating an if_bridge(4) and attaching the network interface card, in my case an em(4) card and epairXb. Any frame to the bridge is relayed to the relevant epair(4). (Note, this not a route). I set my jail IP range as 192.168.100.0/24, just for organizational purposes. I also set the ISPs IP subnet to be 192.168.0.0/16, otherwise it would drop packets from 192.168.100.0/24. I am using TunnelBroker for my IPv6 traffic, as Verizon Fios does not offer IPv6. (As an side, this may be a good thing, since ISPs typically blocks ports, whereas TunnelBroker is completely unfiltered.) With that, Boom, network connectivity!

But…I wanted something repeatable per reboot, in the event of a power failure or loss. This meant I needed to go a little further. And here’s the complicated part. It took me about 4 hours to properly configure /etc/jail.conf:

/* Template */
host.hostname = "${name}.my.domain.prefix";

$ip4_route      = "192.168.100.1";
$ip6_route      = "IPV6PREFIX::1";

vnet;
vnet.interface = "epair${if}b";

persist;
allow.mount;
mount.devfs;
allow.sysvipc;

exec.prestart =  "ifconfig epair${if} create up";
exec.prestart += "ifconfig epair${if}a up";
exec.prestart += "ifconfig bridge0 addm epair${if}a up";

#exec.start += "/sbin/ifconfig epair${if}b up";
exec.start += "/sbin/ifconfig epair${if}b inet  ${ip4_addr}/24 up";
exec.start += "/sbin/ifconfig epair${if}b inet6 ${ip6_addr} prefixlen 64 up";

exec.start += "/sbin/route -4 add default ${ip4_route}";
exec.start += "/sbin/route -6 add default ${ip6_route}";

exec.start += "/sbin/ifconfig epair${if}b down";
exec.start += "/sbin/ifconfig epair${if}b up";

exec.start += "/bin/sh /etc/rc";

exec.stop = "/bin/sh /etc/rc.shutdown";
exec.poststop = "ifconfig bridge0 deletem epair${if}a";
exec.poststop = "ifconfig epair${if}a destroy";

irc {
        path = /usr/local/jail/irc;
	$if = "0";
	$ip4_addr 	= "192.168.100.2";
	$ip6_addr 	= "IPV6PREFIX::2";
}

www {
        path = /usr/local/jail/www;
	$if = "1";
	$ip4_addr 	= "192.168.100.3";
	$ip6_addr 	= "IPV6PREFIX::3";
}

In short, upon initialization, this creates a new epair(4) as specified by $if, attaches it to the jail, assigns the relevant IPv4/IPv6 information, and starts the init scripts. Shutdown is a mere detachment from the bridge and destruction of the epair(4). I also needed to assign the legacy IPv4 address to my em(4) interface.

Finally, I added the following sysctl(8) settings to /etc/sysctl.conf:

net.inet.ip.forwarding: 1
net.inet6.ip6.forwarding: 1

I did a lot of testing, reboot, restarting the jail, etc, and every time it worked. From the jails’ perspective, they didn’t even “know” they were migrated from one system to another. I wish I had tested if a FreeNAS plugin survived the migration, but I never used FreeNAS plugins anyways (what is this Plex I keep hearing about?).

Going forward, I plan:

  • Place the jails on a properly separate VLAN to segment the network
  • Consider use pfSense running in bhyve(8) to function as the Jail’s firewall of choice
  • Look into vale(4) to replace if_bridge(4). But I can’t find any documentation on it!
  • Figure out why TunnelBroker is failing on FreeBSD, but works just fine on my Linux Raspberry Pi – likely the fault of the ISP router.

My only regret: not installing HardenedBSD with LibreSSL.

Thoughts?

FreeBSD kernel Makefile variables SRCTOP and SYSDIR

I am currently writing a FreeBSD device driver and find myself lugging around the entire src. As you can imagine, this is quite large, especially if you are using any sort of version tracking system. So following the example here, I extracted out:

/usr/src/sys/modules/rtwn/
/usr/src/sys/dev/rtwn/

into

/home/user/src/rtwn/sys/modules/rtwn/
/home/user/src/rtwn/sys/dev/rtwn/

However, when I ran make(1) in the /home/user/src/rtwn/sys/modules/rtwn, I received an error saying:

make: don't know how to make r92c_attach.c. Stop

This error message is extremely non-descriptive of the actual issue. After reviewing the aforementioned functioning Makefiles, I identified that the SRCTOP and SYSDIR were not set correctly.

SRCTOP is the equivalent of /usr/src. If your src directory differs from /usr/src, such as $HOME/src/freebsd12src, you would set SYSDIR to $HOME/src/freebsd12src/.

SYSDIR is similar. Ordinarily it would be /usr/src/sys, but now it might be $HOME/src/freebsd12src/sys/.

This can be resolved two ways:

  1. Command-line over-ride. I am doing this:
    make VARIABLE="something"
    For me, that would be:
    make SRCTOP=$HOME/src/freebsd12src/ SYSDIR=$HOME/src/freebsd12/sys/ -C sys/modules/rtwn load.
  2. Permanent method: Edit the Makefile in question, in my case sys/modules/rtwn/Makefile.
    SRCTOP="/home/user/src/freebsd12src/"
    SYSDIR="/home/user/src/freebsd12src/sys"

And of course, you have to have at least one correct src directory in order to compile a kernel object. This is pretty simple, but it confused me for a while. Hope this helps! Keep writing that BSD code!

Linux kernel code vs FreeBSD kernel code

Linux driver code contains some serious garbage. I heard this refrain, but I did not realize how bad it was until I looked at it myself. Here is just one example.

Device drivers typically read static memory, typically known as EEPROM or ROM, from the chip to identify version, hard-coded information, device capabilities, etc. These values are used throughout execution of the driver. The reading process is among the first things when the device is attached and powered on.

In the case of FreeBSD, after the kernel reads the ROM, it uses a struct pointer with all the variables pre-populated, and points it at the ROM blob data stored in memory. For example:

struct r88e_rom {
	uint8_t		reserved1[16];
	uint8_t		cck_tx_pwr[R88E_GROUP_2G];
	uint8_t		ht40_tx_pwr[R88E_GROUP_2G - 1];
	uint8_t		tx_pwr_diff;
	uint8_t		reserved2[156];
	uint8_t		channel_plan;
	uint8_t		crystalcap;
#define R88E_ROM_CRYSTALCAP_DEF		0x20

	uint8_t		thermal_meter;
	uint8_t		reserved3[6];
	uint8_t		rf_board_opt;
	uint8_t		rf_feature_opt;
	uint8_t		rf_bt_opt;
	uint8_t		version;
	uint8_t		customer_id;
	uint8_t		reserved4[3];
	uint8_t		rf_ant_opt;
	uint8_t		reserved5[6];
	uint16_t	vid;
	uint16_t	pid;
	uint8_t		usb_opt;
	uint8_t		reserved6[2];
	uint8_t		macaddr[IEEE80211_ADDR_LEN];
	uint8_t		reserved7[2];
	uint8_t		string[33];	/* "realtek 802.11n NIC" */
	uint8_t		reserved8[256];
} __packed;

_Static_assert(sizeof(struct r88e_rom) == R88E_EFUSE_MAP_LEN,
    "R88E_EFUSE_MAP_LEN must be equal to sizeof(struct r88e_rom)!");

Notice the assertion at the bottom, which ensures that the ROM struct’s size equals a pre-defined length. The code will fail to compile if this assertion is not valid. Later, the kernel will instantiate a struct pointer and point it to the ROM, stored in the variable buf, as follows:

struct r88e_rom *rom = (struct r88e_rom *)buf;

Now, rom->channel_plan is set to the correct value. Simple.

Unfortunately, this is not how the same code is written on Linux. As mentioned, the Linux driver also begins by reading the ROM blob and storing it in a value called hwinfo. But rather than creating an equivalent struct pointer, the Linux code uses offset values of the ROM on an as-needed basis. For example, the driver reads the channel_plan as follows:

rtlefuse->eeprom_version = *(u16 *)&hwinfo[params[7]];

In this example, params[7] comes from a list of ROM offsets values set in the previous calling function. (That alone made tracing difficult.) The rtlefuse->eeprom_version is now the same as FreeBSD’s rom->version. This manual process repeats for every variable in the ROM.

While that may be just annoying and require a negligible bit more CPU power, this is not be a problem if it was done all in one place. But instead, the driver reads from the hwinfo blob on a seemingly as-needed during execution. And because these as-needed instances are during normal execution, the driver reads-in the same static value from hwinfo every a simple WiFi function occurs, such as changing the channel.

Okay, but even that might not be too difficult…right? Here’s the real kicker.

Sometimes, the driver works by using incrementing offsets from the ROM blob. For example, consider at read_power_value_fromprom (in drivers/net/wireless/realtek/rtlwifi/hw.c). It initializes eeaddr as a u32 (uint32_t), then assigns it with the offset value EEPROM_TX_PWR_INX. So far so good. But then, rather than using new offsets for every successive value, it increments the eeaddr value in multiple doubly-nested for-loops. Here is a simplified version of the code:

for (rfpath = 0 ; rfpath < MAX_RF_PATH ; rfpath++) {
		/*2.4G default value*/
		for (group = 0 ; group < MAX_CHNL_GROUP_24G; group++) { pwrinfo24g->index_cck_base[rfpath][group] =
			  hwinfo[eeaddr++];
			if (pwrinfo24g->index_cck_base[rfpath][group] == 0xFF)
				pwrinfo24g->index_cck_base[rfpath][group] =
				  0x2D;
		}
}

Notice the line hwinfo[eeaddr++]! Merely reading in that variable changes the offset. Its the Heisenberg Uncertainty Principle equivalent of code. This is a cleaned-up version of the 188-line function. The actual function has 6 nested for-loops, some with if-statements, each incrementing the eeaddr parameter as they go along.

Why would anyone do it this way? You are needlessly using up the CPU, making the code difficult to follow, repeatedly reading in static values and making any minor modifications and re-ordering or re-structuring will essentially break the entire function.

And perhaps the worst offender is when 20 functions deep you are not even working with hwinfo anymore. You are working to a pointer to hwinfo that has been incremented God-knows where, with their own offsets that are near impossible to track down.

In my efforts to port this driver to FreeBSD, I literally resorted to printing out the entire ROM, manually finding the memory, and backing into the equivalent offset. Other bizarre code: I have seen if-conditions that are impossible to reach, misplaced code that should go in the previous function, code that does bits of a tasks, while another function does the entire task – so repeat code, unnecessarily repeated code, etc.

How does this make it into the Linux Kernel?

To be fair, this does not appear to be the fault of Larry Finger, who maintains this driver. This is the fault of Realtek, for vomiting this terrible driver in the first place, providing absolutely zero documentation and refusing to respond to any contact attempts.

I hope my FreeBSD port is cleaner and more performant!

FreeBSD and Linux Remote Dual Booting

The following is a quick and dirty guide on how to setup remote dual booting for FreeBSD (12.0-CURRENT) and Linux (Ubuntu 16.04). Granted, this method is slightly a hack, but it works and suits my needs.

Why remote dual-booting? I am currently developing a FreeBSD kernel module for a PCIe card. The device is supported on Linux and I am using the Linux implementation as documentation. As such, I find myself frequently rebooting into Linux to look printk() outputs, or booting into FreeBSD to test kernel code. This device is located at my house, and I typically work on it during my downtime at work.

Why not use Grub? I would have preferred Grub! But for whatever reason, Grub failed to install on FreeBSD. I do not know why, but even a very minimalistic attempt gave a non-descriptive error message.

efibootmgr? Any change I made with efibootmgr failed to survive a reboot. This is apparently a known problem. Also, this tool only exists on Linux, as FreeBSD does not seem to have an efibootmgr equivalent.

Ugh, so what do I do???

The solution I came up with was to manually swap EFI files on the EFI partition no an as-needed basis.

First, I went into the BIOS and disabled legacy BIOS booting, enabled EFI booting, and disabled secure booting.

Then, I installed Ubuntu. I had to manually create the partition tables, since by default the installer would consume the entire disk. However, this does not automatically create the EFI partition. So, you must manually create one. I set mine to 200MBs as the first partition. After installation, I booted up, mounted the /dev/sda1. I found that ubuntu had created /EFI/ubuntu/grubx64.efi and other related files. Great!

Next, I installed FreeBSD and while manually setting up the partition tables, FreeBSD auto-created an EFI partition. One already exists, so I safely deleted it, and proceeded with the rest of the install. Right before rebooting, I mounted /dev/ada0p1 (sda1 on Linux) as /boot.local/ and /dev/da0p1 as /boot.installer/. I then copied /boot.installer/EFI/BOOT/BOOTX64.EFI too /boot.local/EFI/BOOT/EFIBOOT/BOOTX64.EFI (I think I had to re-create EFI/BOOT, I’m forgetting off-hand). Then I rebooted.

When I rebooted the machine, Ubuntu still came up. This is because Ubuntu edits the EFI boot order and places ubuntu as the first partition. Ordinarily you should be able to use efibootmgr here to boot into FreeBSD and use the non-existent FreeBSD equivalent to boot back, but with the lack of that option, I mounted the EFI partition (/dev/sda1) as /boot/efi, and when I wanted boot into FreeBSD, I renamed /boot/efi/EFI/ubuntu/grubx64.efi to ubuntu.efi and then copied /boot/efi/EFI/BOOT/BOOTX64.EFI to /boot/efi/EFI/ubuntu/grubx64.efi. When I rebooted, FreeBSD came back up! Then on the FreeBSD side, I mounted /dev/sda1 to /boot/efi and did copied /boot/efi/EFI/ubuntu/ubuntu.efi to /boot/efi/EFI/ubuntu/grubx64.efi.

And that’s it! I can now remotely boot back and forth between the two systems.

Ugly? Yes. But it does the job.

Linux could fix this problem by debugging their efibootmgr utility and FreeBSD could fix this by having an efibootmgr equivalent at all.

Thoughts?