towboot part 6

hardware is hard, actually

You might remember that I wrote a bootloader (see the first post for, why). And you may also remember that

I didn’t perform a test on bare metal

Well, now I did. And it didn’t go too well.

Writing low-level stuff in Rust is fun, because usually, if you make a mistake, the code doesn’t compile. And the compiler gives you a detailed explanation for why. That’s nice.

What’s not nice is when you make a mistake in the assembly part. It might compile, but that doesn’t mean that it’ll actually work.

And that’s where QEMU and GDB come into play: By attaching the debugger via loopback network to a running VM, I’m able to break and step through the running code.1 And that’s very useful.

Sadly, this breaks when enabling hardware acceleration for the VM. Attaching a debugger still works, but breakpoints don’t. (Yes, hardware breakpoints exist, but I never used them.) That’s why I basically never enabled KVM. Booting also doesn’t take that much time. But when I finally enabled hardware acceleration for the x86_64 target, things got ugly: QEMU threw a “KVM internal error” at me and didn’t bother to explain, why.2

targets

At this point, you might ask yourself: What’s a target? And what targets are there?

I chose the x86 family of platforms as the target for our little bootloader. To be even more precise: the bootloader is built to run on both the i686 platform (32-bit) and on the x86_64 platform (64-bit).

So, we have two targets to test:

  • i686
  • x86_64

Both work on QEMU without hardware acceleration. x86_64 broke when enabling KVM:

(...)
[TRACE]: src/boot/elf.rs@142: Loaded section SectionHeader { sh_name: 17, sh_type: "SHT_STRTAB", sh_flags: 0x0, sh_addr: 0xe573aee, sh_offset: 0x32e3, sh_size: 0x6b, sh_link: 0x0, sh_info: 0x0, sh_addralign: 0x1, sh_entsize: 0x0 } to 0xe573aee
[ INFO]: src/boot/mod.rs@304: kernel is loaded and bootable
[ INFO]: src/boot/mod.rs@312: loaded 0 modules
[ INFO]: src/boot/video.rs@024: setting up the video...
[ WARN]: src/boot/video.rs@034: color depth will be 24-bit, but the kernel wants 32
[DEBUG]: src/boot/video.rs@062: available video modes: [((1024, 768), Bgr), ((640, 480), Bgr), ((800, 480), Bgr), ((800, 600), Bgr), ((832, 624), Bgr), ((960, 640), Bgr), ((1024, 600), Bgr), ((1152, 864), Bgr), ((1152, 870), Bgr), ((1280, 720), Bgr), ((1280, 760), Bgr), ((1280, 768), Bgr), ((1280, 800), Bgr), ((1280, 960), Bgr), ((1280, 1024), Bgr), ((1360, 768), Bgr), ((1366, 768), Bgr), ((1400, 1050), Bgr), ((1440, 900), Bgr), ((1600, 900), Bgr), ((1600, 1200), Bgr), ((1680, 1050), Bgr), ((1920, 1080), Bgr), ((1920, 1200), Bgr), ((1920, 1440), Bgr), ((2000, 2000), Bgr), ((2048, 1536), Bgr), ((2048, 2048), Bgr), ((2560, 1440), Bgr), ((2560, 1600), Bgr)]
[DEBUG]: src/boot/video.rs@080: chose (1024, 768) as the video mode
[ INFO]: src/boot/video.rs@085: set (1024, 768) as the video mode
[DEBUG]: src/boot/video.rs@096: gop mode: ModeInfo { version: 0, hor_res: 1024, ver_res: 768, format: Bgr, mask: PixelBitmask { red: 0, green: 0, blue: 0, reserved: 0 }, stride: 1024 }
[DEBUG]: src/boot/video.rs@131: passing Multiboot(FramebufferTable { addr: 2147483648, pitch: 4096, width: 1024, height: 768, bpp: 32, color_info: Some(Rgb(ColorInfoRgb { red_field_position: 16, red_mask_size: 8, green_field_position: 8, green_mask_size: 8, blue_field_position: 0, blue_mask_size: 8 })) })
[DEBUG]: src/boot/config_tables.rs@019: going through configuration tables...
[DEBUG]: src/boot/config_tables.rs@027: ignoring lzma filesystem
[DEBUG]: src/boot/config_tables.rs@025: ignoring dxe services table
[DEBUG]: src/boot/config_tables.rs@026: ignoring hand-off block list
[DEBUG]: src/boot/config_tables.rs@029: ignoring early memory info
[DEBUG]: src/boot/config_tables.rs@024: ignoring image debug info
[DEBUG]: src/boot/config_tables.rs@028: ignoring early memory info
[DEBUG]: src/boot/config_tables.rs@108: handling SMBIOS table
[DEBUG]: src/boot/config_tables.rs@038: handling ACPI RSDP
[DEBUG]: src/boot/config_tables.rs@038: handling ACPI RSDP
[DEBUG]: src/boot/config_tables.rs@032: ignoring table dcfa911d-26eb-469f-a220-38b7dc461220
[ INFO]:  src/main.rs@109: booting multiboot...
[DEBUG]: src/boot/mod.rs@346: expecting 132 memory areas
[DEBUG]: src/boot/mod.rs@363: passing 732803074 to kernel...
[ INFO]: src/boot/mod.rs@366: exiting boot services...
KVM internal error. Suberror: 1
emulation failure
EAX=2badf5b4 EBX=00000000 ECX=00000080 EDX=00000078
ESI=0e7ead18 EDI=0804993c EBP=2badb002 ESP=0fef92dc
EIP=000b0000 EFL=00010007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0030 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0038 00000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0030 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0030 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0030 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0030 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     0f9dc000 00000047
IDT=     0f64b018 00000fff
CR0=00010033 CR2=00000000 CR3=0fc01000 CR4=00000648
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000900
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

So, did I do something wrong? Or is something wrong with QEMU? An “internal error” sounds, well, internal, idk.3

And the best way (the most interesting one, at least) was to try real hardware.

a short excursion into the forgotten land of x86 tablets

Sadly, there are not that many real-world examples of i686 UEFI devices: x86_64 and UEFI both appeared roughly at the same time.

But, there are some devices in this niche4: some Apple devices, some netbooks and some tablets5. But they still sell for money on Ebay and well, I didn’t really want to spend too much on checking that, indeed, towboot works on an obsolete architecture.

So, I asked on Mastodon. And promptly, I was handed two tablets. One was an HP Stream 7 still running Windows 10 fine (but very slowly), so I didn’t dare touch it. The other one was a ionik thingy. It has a broken flash, so it boots directly into an UEFI shell. You may not like it, but this is what peak UEFI application development looks like.

So, I put a current build of towboot, the example kernels and some kernels I pulled from GitHub on a USB flash drive (and connected a hub and a keyboard and power via an OTG adapter) and many of them actually booted! I am still happy about that and it was a really fun experience, but sadly, it wasn’t that useful, because i686 already worked on QEMU with KVM, so yeah.

playing tower defense

Finding a x86_64 UEFI system was way easier. Basically every modern desktop and laptop matches these criteria (unless they’re already ARM based!).

The laptop I’m currently typing on matches this description, but I didn’t want to break the laptop I’m usually typing on, so I asked the Operating Systems research group at HHU and they gave me an old tower PC to play with.

So, I plugged in my USB flash drive and — it didn’t boot. It hung at the same step where QEMU crashed. Hardware is, well, hard.

about the same log on a real PC, without the KVM error

about the same log on a real PC, without the KVM error

Okay, but what went wrong? How do I debug this? I can’t just attach GDB to a tower PC.

So, we’re back to virtualisation — or even better: emulation. I tried various other VM software (VirtualBox threw a Guru Meditation, which is a nice nod to Amiga, but not that helpful for me) and ended up using Bochs6, because it emulates7. This gives much more helpful error messages (even though it’s way slower):

06200266346e[CPU0  ] SetCR0(): attempt to leave 64 bit mode directly to legacy mode !

the different modes of a modern x86 CPU

You see, a modern x86 CPU is basically a Matryoshka: it contains many modes to support old software8:

  • Long Mode

    • 64-Bit Mode

    • Compatibility Mode (32-Bit)

  • Legacy Mode

    • Protected Mode (32-Bit or 16-Bit)

    • Virtual-8086 Mode (16-Bit)

    • Real Mode (16-Bit)

Why is any of this relevant? You might remember that Multiboot requires a certain machine state: 32-bit Protected Mode without paging or PAE. You might also remember that bootloaders run in the firmware’s “native” mode.

So, for the i686 target this is basically irrelavant: We just stay in Protected Mode and just disable paging and PAE (if they were enabled). But on x86_64, we have to drop from 64-Bit Mode through Compatibility Mode into Protected Mode. (There’s no direct path.) And as I’ve written before:

Switching from 64-Bit Mode to 32-bit Protected Mode requires switching to Compatibility Mode, first. Luckily, rustc automatically generates the necessary instructions for that when compiling 32-bit inline assembly for a 64-bit target. What we still have to do ourselves is disabling interrupts, paging, PAE and Long Mode (thus switching from Compatibility Mode to 32-Bit Protected Mode). All of this can be done by idempotent instructions, so there is no need to check whether the CPU already is in 32-bit Protected Mode or whether PAE or paging is enabled, yay.

And that’s wrong because I’m an idiot.

rustc doesn’t generate any code to switch to Compatiblity Mode. Why should it? And I didn’t check. And surprisingly, it still worked in QEMU.

But 64-bit mode requires paging and when towboot tried to disable it, Bochs threw a “SetCR0(): attempt to leave 64 bit mode directly to legacy mode !”. This is still not correct, but way more helpful than QEMU, because I can now look at the Bochs source code and see what I’ve done wrong:

else if (BX_CPU_THIS_PTR cr0.get_PG() && ! pg) {
    if (BX_CPU_THIS_PTR cpu_mode == BX_MODE_LONG_64) {
        BX_ERROR(("SetCR0(): attempt to leave 64 bit mode directly to legacy mode !"));
        return 0;
    }

cpu_mode is wrong. How does Bochs define that?

BX_MODE_LONG_64 = 4           // EFER.LMA = 1, CR0.PE=1, CS.L=1

Sure, fair enough. Maybe if I first clear the Long Mode flag in the EFER

#if BX_SUPPORT_X86_64
  /* #GP(0) if changing EFER.LME when cr0.pg = 1 */
  if ((BX_CPU_THIS_PTR efer.get_LME() != ((val32 >> 8) & 1)) &&
       BX_CPU_THIS_PTR  cr0.get_PG())
  {
    BX_ERROR(("SetEFER: attempt to change LME when CR0.PG=1"));
    return 0;
  }
#endif
    BX_CPU_THIS_PTR efer.set32((val32 & BX_CPU_THIS_PTR efer_suppmask & ~BX_EFER_LMA_MASK)
        | (BX_CPU_THIS_PTR efer.get32() & BX_EFER_LMA_MASK)); // keep LMA untouched

So, we have to change CS.L. And this actually makes much sense: Having 32-bit code running in Long Mode is called … Compatibility Mode and that’s where we want (have) to go first.

But you can’t just flip a bit in CS. CS is no normal register.9

what’s segmentation anyway?

Once upon a time, there were 16-bit CPUs with 20-bit memory buses.

So, how does that work? If you want to write a simple value, say

mov [ax], 0xab

then you don’t write to *ax, you write to *(ds * 16 + ax) (DS is the Data Segment and can simply be set).

Or you can specify the segment register to use (there are DS, ES, FS, GS and SS10):

mov es:[ax], 0xab

Code is read from the, well, Code Segment. If you jump like that

jmp 0xabc

you’re actually jumping to cs * 16 + 0xabc. (This is called a near-jump.) But CS can’t be set like any other register — this would invalidate the current instruction pointer, oops. Instead you pass it when jumping:

jmp 0xde:0xabc

(This is called a far-jump.)11

That’s how it was on the 8086 in the seventies.

In the eighties, there was a 286 which had a Protected Mode (but just 16 bits!) and this one also had segments, but they worked quite differently: The values in the segment registers are now used as indices into a table of Segment Descriptors (the Global Descriptor Table). And these different segments can have different privileges. That’s early multitasking for you!

The 386’s 32-bit Protected Mode also had this feature, but it also came with paging12 which people seem to prefer.13

The table might look like this (in the very simple case of only one application):

index content
0 NULL14
1 system code
2 system data
3 application code
4 application data

More applications may mean more segments. Be aware that all of this memory has to exist — there’s no swapping here!

And each of these segment descriptors contains various information about the corresponding segment:

  • base (Where does the segment start?)
  • limit (Where does it end?)
  • access:
    • present (It this segment loaded?15)
    • privilege level16
    • type17
    • executable (Is this code or data?)
    • direction (In which direction does the data grow?)
    • conforming (Can the code be called from a different privilege level?)
    • read-write protection18
    • access (Has this segment recently been accessed?)
  • flags:
    • granularity (Is the limit measured in bytes or in 4KiB pages?)19
    • size (Is this a 16-bit or a 32-bit segment?)
    • long-mode (Is this a 16-bit or a 64-bit segment?)20

And don’t be fooled to think that these fields are layed out sequentially in memory. It’s far, far worse.

And how does the CPU find this table? You have to create a Global Descriptor Table Register (which contains the address of the GDT and its size), put it somewhere in memory and load it with a hearty

lgdt [my_super_cool_gdt]

and then … nothing changes. The new GDT is in effect, but the segments have not changed. You’ll need a far jump to actually apply the code segment.21

But we’re just a smol bootloader, so this table can be simpler than the one I described above: It just needs to contain the null segment, a code segment and a data segment that both span the whole memory. And indeed, this is what the Multiboot specification requires:

CS Must be a 32-bit read/execute code segment with an offset of ‘0’ and a limit of ‘0xFFFFFFFF’. The exact value is undefined.

DS, ES, FS, GS, SS Must be a 32-bit read/write data segment with an offset of ‘0’ and a limit of ‘0xFFFFFFFF’. The exact values are all undefined.

So, let’s do this. Luckily, we don’t have to db this in inline assembly22, there’s a nice crate called x86. It has Java-style Builders for CPU-related data structures which is both extremely cursed and extremely cool (the combination I hope you’re here for).

Initially, I created 16-bit segments and I’m still impressed that Bochs let me use them directly from 64-bit mode.

let code_segment_builder: DescriptorBuilder = SegmentDescriptorBuilder::code_descriptor(
    0, u32::MAX, CodeSegmentType::ExecuteRead,
);
let code_segment: Descriptor = code_segment_builder
    .present()
    .limit_granularity_4kb()
    .db() // 32 bit
    .finish();
let data_segment_builder: DescriptorBuilder = SegmentDescriptorBuilder::data_descriptor(
    0, u32::MAX, DataSegmentType::ReadWrite,
);
let data_segment: Descriptor = data_segment_builder
    .present()
    .limit_granularity_4kb()
    .db() // 32bit
    .finish();
let gdt_array = [Descriptor::NULL, code_segment, data_segment];
let gdt = DescriptorTablePointer::new_from_slice(&gdt_array);

unsafe {
    x86::irq::disable();
    x86::dtables::lgdt(&gdt);
    // This IDT is invalid (but that's no problem as interrupts are disabled).
    x86::dtables::lidt::<u32>(&DescriptorTablePointer::default());
    asm!(
        "push 0x08", // code segment
        "lea rbx, [2f]",
        "push rbx",
        // This "return" allows us to overwrite CS.
        "retfq",

        // We're now in compatibility mode, yay.
        "2:",
        ".code32",
        "mov eax, 0x10", // data segment
        "mov ds, eax",
        "mov es, eax",
        "mov fs, eax",
        "mov gs, eax",
        "mov ss, eax",
    );
}

(You may ask yourself: “Why are these explicit type hints neccessary?” And I don’t know either.)

Getting from Compatibility Mode to Protected Mode is relatively easy: I already had that part in place (via the EFER).

the end

And with all that in place, it boots in QEMU-KVM, Bochs and on real hardware. :)

So, what’s next? Multiboot 2, of course.

the Multiboot 2 test kernel, booted on a real PC

the Multiboot 2 test kernel, booted on a real PC


  1. That sounds easier than it is, though. The firmware may place our application anywhere in memory, so I can’t just pass an address to GDB. There’s gotta be some way, though.

    Oh, and of course, attaching a debugger also works for the Rust part (but I’ve never had to). 

  2. Well, it dumped all registers afterwards, but that’s not that helpful. 

  3. Yes, this config_tables stuff is a hint to early Multiboot2 support. 

  4. Funnily enough, most of these devices actually have a 64-bit CPU. But, because the architecture of the firmware and the operating system must match over in Windows land (and because 32-bit Windows is faster, somehow), it actually made sense for a vendor to match a then-current, low-power 64-bit CPU (eg. an Atom) with a 32-bit firmware and a 32-bit Windows. These devices are never able to see Windows 11 and will soon be obsolete, though. 

  5. Wait, aren’t tablets usually ARM based?” Nowadays, yes. Well, nowadays, even laptops are starting to ship with ARM. But 10 years ago, Windows on ARM was barely usable (support for running x86 applications shipped as late as 2017). And don’t forget that Android only had a usable tablet UI since 2011. 

  6. you see, it puts operating systems into a box 

  7. I don’t want to go on a tangent explaining the difference between emulation and virtualisation, but the important part here is that Bochs is not just running the guest system’s code in an unprivileged process and trapping on hardware access, it interprets the code and keeps its own model of the CPU

  8. This might finally change with x86s, but for now, those are dreams of the future. And when it comes, it might annihilate Multiboot completely. (Yes, it’s possible to boot directly into 64-bit kernels with Multiboot 2, but I have never seen that possibility actually in use.) 

  9. This was way harder to figure out than it might seem. I took a look at the Limine bootloader does it, but I also neededsome time to understand what this was actually doing.

    I still do not know what the Linux kernel or GRUB are doing. 

  10. which is the Stack Segment and is used for push and pop 

  11. This also applies to call and ret

  12. Paging has the added benefit that the applications do not have to be placed sequentially in memory. Their data (and code, obviously) can be interleaved and even swapped out when not needed. But really, that’s not the topic at hand. 

  13. gnumach seems to be the only current (well, sort of) kernel that makes use of 32-bit segmentation — and only for a short time during boot. This really broke my gdb setup. 

  14. idk why there’s a null descriptor 

  15. I’d imaging one could build a very primitive swapping approach with this: simply swap out the whole application. 

  16. This is on a spectrum from 0 (kernel) to 3 (applications) — in theory. In reality, only those two levels are used. 

  17. that’s for task switching (don’t ask me what this is) 

  18. but only partially: data is always readable, code is never writable (hello NX!)

    But, because segments may overlap, you’re actually able to write to the memory referenced by a code segment, if you’ve got a writable data segment covering this. 

  19. The granularity is actually really important as you can’t get to the full 32 bits, otherwise. 

  20. Yes, you could create a segment that’s both 32-bit and 64-bit at the same time. I don’t know what would happen. 

  21. I have no idea when the data segments kick in. 

  22. Well, I tried and failed. Those x86 addressing modes are weird. 


Kommentare

Die eingegebenen Daten und der Anfang der IP-Adresse werden gespeichert. Die E-Mail-Adresse wird für Gravatar und Benachrichtungen genutzt, Letzteres nur falls gewünscht. - Fragen oder Bitte um Löschung? E-Mail an (mein Vorname)@ytvwld.de.