Writing a Kernel

4 minutes read (1370 words)

Writing an operating system is an interesting task. It touches a lot of aspects of software engineering that don't pop up elsewhere, like process scheduling, highly discontinuous program flow due to the interrupt-driven nature of most OS, and very close to the hardware programming in device drivers.

I'm writing an OS again. As with every project keeping the scope small helps when starting out with the project. To this end I'm only targeting AArch64; more specifically I am working on a Raspberry Pi Compute Module 3. I'm building a macrokernel because a microkernel has no tangible benefit for the purpose or environment I'm building the system for. Designing parts more thoroughly instead of trying to solve all the problems microkernel bring will be a more sensible use of time. Going with the RPi and AArch64 gives me the opportunity to learn more about the latter and not having to work with something as utterly broken and senseless as the IBM x86 PC platform. And while the BCM2837 has very bad manufacturer documentation it is rather well documented by the community and has code written for it that can be used as reference.


Most operating systems – including mine – are interrupt-driven. Interrupts are (mostly) external impulses that well … interrupt the current program flow and make the CPU jump to specific code in the kernel (called the interrupt vector table — IVT for short — on most architectures still around). These interrupts can come from peripherals like the UART which works independent of the CPU but will notify it when a byte has been received. They can additionally be triggered by devices that are part of the CPU like for example clock cycle counting timers — which are useful to implement preemptive scheduling. They can also be triggered by software running on the CPU itself. This is for example how syscalls to the kernel are usually implemented.


CPUs connect devices in one of three ways: By defining additional instructions or extensions to the instruction set architecture (ISA) to communicate with a data port — called port-mapped IO; by mapping the device or peripherals into specific memory areas — called memory-mapped IO; and by offloading parts of the job of preparing data for the devices into external processors attached as coprocessors or using direct memory access (DMA) to pull/push data from/into the CPUs context. This last mode is usually called channel-based IO back from the days of IBM mainframes which used those mode as the main mechanism of peripheral access. Nowadays it has lost some of it's ubiquity but for example graphics cards are very much akin to this principle.

In the case of the BCM2837 memory-mapped IO has the most prevalence. ARM has a few extensions in their Cortex-A profile that use named registers for functionalities such as the cycle counting timer but all functionality specific to the BCM is mapped to blocks of memory.

Compiling a Kernel

Compiling a Kernel requires a similar build chain as compiling embedded software — you need a compiler targeting the build triple we need, a scriptable enough linker and an implementation of objcopy that can create an raw image from the binary the linker produces.

For the Raspberry Pi the build target is aarch64-unknown-none. For people not well versed in decoding target triples: This means an unknown vendor (Because we don't have a software platform we're working on and don't care about any propietary extensions yet) and none sys (Because we again don't have a software platform we're working on and the ABI is instead defined by the language itself)

I will use Rust/Assembly for my Kernel and all examples. The reference compiler for Rust — rustc — is built using LLVM; this means it's a cross-compiler by default. I'm using Gentoo as my main system, where it's easy to ensure rustc provides the correct target by setting the LLVM_TARGETS USE-flag. With other distros you will have to check using rustc --print target-list. Rust also comes with the lld linker so the only tooling missing is objcopy.

Having built a cross-compiler the last thing to be done before being able to start writing code in earnest is compiling support libraries. While I am not using Rust's std-crate I am still using libcore; libcore is the platform-agnostic, basically dependency-free library that provides a few elementary types for programming. It is possible to write Rust without libcore but this means doing without &str, char, Option, Result, Iterator combinators, Atomic types, panic! and a lot of Operator traits. To be able to compile libcore a few symbols need to be provided:

A note about allocation; libcore doesn't know or particular care about heap allocation — it works completely using stack allocation. This is very good for now since it frees me from having to write an allocator, but I will very soon want an allocator for a few things. At that point I can use core::GlobalAllocator to make the surrounding ecosystem easily aware of the existence of said allocator.

Reset and Bootstrapping

The Raspberry Pi firmware runs a few initialization stuff and then jumps to our code which it expects to be loaded at 0x80000. I'm using this code to do some basic setup for my kernel that I don't want to do in Rust.

.global _start
    mrs     x1, mpidr_el1   // Read current CPUID
    and     x1, x1, #3      // Mask all but the CPUID bits

    ldr     x3, =_start

    cbz     x1, config      // Skip halt if x1 is zero

1:  wfe                     // Halt CPU
    b       1b

To ensure that the _start symbol is located at the right offset a small linker script can be used:


    . = 0x80000;

    .text :
        KEEP(*(.text.boot)) *(.text .text.*)

    .rodata :
        *(.rodata .rodata.*)

    .data :
        *(.data .data.*);

    .bss ALIGN(8):
        __bss_start = .;
        *(.bss .bss.*)
        __bss_end = .;

    /DISCARD/ : { *(.comment) *(.gnu*) *(.note*) *(.eh_frame*) }

In the _start routine Core #0 jumps to config, a subroutine that will set up EL3 registers if the CPU is in EL3, otherwise directly jump to EL2 configuration:

    mrs x0, CurrentEL
    sub x1, x0, #8      // We are in EL2
    cbz x1, config_el2  // => skip EL3 configuration

    mov x2, #0x5b1
    msr scr_el3, x2
    mov x2, #0x3c9
    msr spsr_el3, x2
    adr x2, config_el2
    msr elr_el3, x2

This is important for being able to test the code using QEMU which does not implement EL3 which the BCM does have.

// EL2 specific
    msr sp_el1, x3
    ldr x0, =HCR_VAL
    msr hcr_el2, x0

    ldr x0, =vectors
    msr vbar_el1, x0

    mov x2, #0x3c4
    msr spsr_el2, x2
    adr x2, el1_start
    msr elr_el2, x2


    mov sp, x3          // Point the stack before our code to not overwrite
    bl  reset           // Call the `reset` function
    b   1b              // Failsafe: Halt CPU if we ever reach this instruction

In el1_start the first non-assembly function is called: reset.

pub unsafe fn reset() -> ! {
    extern "C" {
        static mut __bss_start: u64;
        static mut __bss_end: u64;

    r0::zero_bss(&mut __bss_start, &mut __bss_end);


This function has two jobs: It zeros out the bss space in our image to not contain uninitialized memory and then it completes the switch to Rust by calling start(), jumping to crate::main and thus Rust proper.

At this point it's possible to really start writing kernel code. It's only running on one core for now, but it's a start.