Long Mode

In this article, we’ll expand on our previous work of getting a trivial chunk of C to get loaded and run by GRUB2.

We’re going to complete the basic boot process and get in to a 64-bit C environment.

Concepts

Long Mode

Long Mode is 64-bit mode. The AMD64 spec changes a lot of the behavior of the processor based on whether it’s in Long Mode or not. The opposite of Long Mode is Legacy Mode in which an AMD64/Intel64 processor will run like a Pentium (albeit a fast one).

Descriptors (aka Selectors)

Descriptors are an old way to describe memory. x86 and x86_64 have three descriptor tables, each of which hold a number of descriptors of different types. For now, only the GDT and IDT are interesting.

GDT

In the GDT (Global Descriptor Table), descriptors define code and data “segments” (and others we won’t cover here). The x86 and x86_64 processors have “segment registers” (CS for code and DS through GS for data access) that are byte references into this table. Each of these descriptors roughly contains a start address, size, a type, operation size (16,32 bit) and permissions info.

In 32-bit kernels, a code and data descriptor are setup for the kernel (privilege 0, supervisor) and also for userspace (privilege 3, user) for a total of 2 code and 2 data descriptors (among others that are irrelevant right now).

In Long Mode, a previously reserved bit is defined as the “Long bit” which tells the processor that the descriptor is 64-bit. In this case, the base and size of the descriptor are ignored and many of the obsolete type values are invalid. 32-bit descriptors can be used when Long Mode is enabled however, which is how 32-bit compatibility mode works (code run with a CS register pointing to a 32-bit code descriptor is in 32-bit compatibility mode).

IDT

The IDT (Interrupt Descriptor Table) includes a bunch of descriptors that tell the processor what to do when it receives an interrupt (which can be everything from a timer going off, an error, or a hardware notification). This table includes a separate (but similar) type of descriptor called the Interrupt Gate that includes a code segment (CS setting, referencing the GDT) and an address to jump to when an interrupt is received. In Long Mode, this descriptor is extended to allow a 64-bit target address and the code segment must be a 64-bit one.

Paging

Paging is the mechanism that replaced segmentation (descriptors) to manage memory. It’s extremely powerful and flexible. The core idea of paging is the separation of “virtual” and “physical” address spaces.

Without paging, when you reference memory @ 0x1000 you’re referencing the 0x1000th (4096th) byte of physical memory. If you access memory beyond the end of physical memory, you’ll error.

With paging, 0x1000 is a “virtual” address, which means that it can be mapped to any page (4k chunk) of physical memory. 0x1000 “virtual” could be 0x7f8000 “physical”.

Each process in a typical kernel has its own context, which includes its set of mappings from virtual to physical addresses. When one process is running, another process’ memory isn’t reachable (unless by design for something like threading). Separate contexts for each process also means that multiple programs linked at the same (virtual) address can happily run simultaneously because they occupy separate physical memory.

There are other advantages to paging as well. Fine-grained permissions, write monitoring, and page faults allow us a lot of flexibility with how we handle memory, but just for getting to Long Mode we need to make one context (the kernel context), that maps the kernel’s link address (virtual) to the memory GRUB loaded the kernel to (physical) so that we can enable paging and continue to run.

We’ll briefly get in to the mechanics of paging in this article, but it will come up over and over as memory management is one of the core tasks of the kernel. We’ll encounter paging again when we write our memory manager, and yet again when we start forking processes, and yet again when we deal with IO.

Implementation Note

Why Assembly?

It’s definitely possible to write a kernel in C with a bare minimum of inline instructions when you need to do something special (like loading the GDT/IDT registers, or switching paging contexts), but this would entail a lot of double checking assembly output, and a key problem with doing this is that addresses are not as easy to manipulate in C without a whole lot of casting pointers and other ugliness. This is especially an issue for our initial code because the link address (virtual address, the one you get when referencing a symbol) is different than our load address (where GRUB puts us and where addresses should be before we enable paging).

Why NASM?

For all the assembly in Viridis, I’ve decided to use NASM instead of the built-in GCC assembler as. The reasoning behind this is that I find NASM syntax to be clearer than as (even with -masm=intel), in addition to supporting the BITS directive to mix 32 and 64-bit code in the same file (which is important for this chapter).

I’m using the GCC C pre-processor (cpp) on top of the NASM files which seems like a hack, but it’s intended to allow us to share headers between NASM and C and avoid having to keep two sets synchronized.

The Road to Long Mode

For the rest of this article, the AMD64 Programmer’s Manual v2 is going to be our guide.

Chapter 14 covers the power on state of the chip and then covers the initialization processes. Fortunately, GRUB has already gotten us into Protected Mode so we needn’t worry about that part, although the Multiboot Specification section 3.2 mentions that the GDT it setup is probably no longer valid so we should set that up in known memory.

According to 14.5 (Long Mode initialization) we need to do the following. I’ve re-ordered them to the order in which we’ll actually do them.

  1. The GDT must be setup with a 64-bit Code Segment
  2. The IDT must be setup with 64-bit Interrupt Gates
  3. PAE paging structures must be in place

At which point, you can move on to 14.6 which describes the mechanics of enabling long mode (set EFER[LME]), and activating it by enabling paging (getting EFER[LMA] set which confirms that Long Mode has actually engaged).

Initial Tweaks

When we last left off, we were loading a simple C “kernel” that did nothing but loop in place forever. This time we’re actually going to do something complex, so there are a handful of miscellaneous tweaks I’ve made to simplify things.

linker.ld

If you remember, last time we setup a linker script to force our GRUB signature to be properly placed in the resulting binary.

This time, I’ve made some additions.

OUTPUT_FORMAT("elf64-x86-64")
ENTRY(entry)
SECTIONS
{
    kernel_start = 0xFFFFFFFF80100000;

    .grub_sig 0x100000 :
    {
        *(.grub_sig)
    }
    .text_early 0x100080 :
    {
        *(.text_early)
    }
    .text 0xFFFFFFFF80102000 : AT(0x102000)
    {
        *(.text)
    }
    .data :
    {
        *(.data)
    }
    .bss :
    {
        *(.bss)
    }
    /DISCARD/ :
    {
        *(.comment)
        *(.eh_frame)
    }
    kernel_end = . ;
    kernel_size = kernel_end - kernel_start;
}

The first thing of note is the addition of .text_early. Before we setup paging, we’re dealing with physical addresses only, so this section will include all of the code that expects us to be using physical addresses directly. This is so, for example, we can use call populate_gdt and the address will be the correct physical address, rather than a currently invalid virtual address.

The second thing of note is that we get three linker variables. These are extremely useful. kernel_start,kernel_end, and kernel_size. We’ll use these when setting up paging to make sure that we’ve included our entire kernel.

Lastly, I’ve changed the ENTRY to “entry” instead of “main”, but that’s purely semantics.

entry

At this point, our basic kernel could be converted to assembly, looking something like this:

BITS 32

/* linker.ld */
EXTERN kernel_size

/* We specially link this at our load address to simplify jmps. */
SECTION .text_early

GLOBAL entry

entry:
    jmp $

Essentially, a bunch of directives and an infinite loop. We’ll add on to this later, but for now let’s talk about how we’ll host our segment descriptors.

Hosting the GDT

After we’re dropped into our code, we want to load our own GDT with three descriptors. One for 32-bit code (that we’ll use after we load the GDT but before we enable paging), one for 64-bit code (that we’ll use after paging) and a data descriptor that will work in either mode.

Structure

The structure of a single GDT descriptor is shown in figures 4-13 (legacy code and data), 4-20 (long code), and 4-21 (long data) in the AMD64 Programmer’s Manual v2. Legacy and long data descriptors are compatible (as 4-21 shows, only the valid bit matters for long data descriptors).

From the head of asm/gdt.asm we can describe on descriptor like this:


...
/*  DW     SEG_LIMIT_LOW
 *  DW     BASE_ADDRESS_LOW
 *  DB     BASE_ADDRESS_MID
 *  DW     FLAGS
 *  DB     BASE_ADDRESS_HIGH
 *
 * Base Address is obviously the 32-bit address starting the segment.
 *
 * Segment Limit is 20-bits of size.
 *
 * FLAGS = 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
 *         G  D  R  A  \         / P  \DPL/ S  \         /
 *               S  V   SEGMENT HI                TYPE
 *               V  L
 *
 * For code descriptors, type is
 *  8 - Execute Only
 *  A - Executable / Readable
 *
 * There's also a concept of "conforming" with the higher bits we don't care
 * about.
 *
 * For data descriptors, type is
 *  0 - Read Only
 *  2 - Read / Write
 *  4 - Expand Down RO
 *  6 - Expand Down RW
 *
 * Each of these types for code and data descriptors has an odd version to
 * indicating the accessed bit that we don't care about either.
 *
 * DPL is a two bit field indicating which privilege level the descriptor is
 * in. 0 is the higher priority, 3 is the low priority. The kernel is in the
 * highest priority.
 *
 * P is the present bit.
 *
 * AVL is available to us, but we don't use it.
 *
 * D/B Default Operand Size, for Long Mode this must be 0, for non-Long mode
 * 1 for 32-bit, 0 for 16-but segment.
 *
 * G - granularity, whether segment limit describes the size in bytes (G=0)
 * or 4k pages (G=1).
...

Which leads to some flag definitions.


...
#define FLAG_CODE   0xa //Read/Execute
#define FLAG_DATA   0x2 //Read/Write

#define FLAG_USER   (1 << 4)
#define FLAG_SYSTEM (0 << 4)

#define FLAG_R0     (0 << 5)    // Rings 0 - 3
#define FLAG_R1     (1 << 5)
#define FLAG_R2     (2 << 5)
#define FLAG_R3     (3 << 5)

#define FLAG_P      (1 << 7)    // Present

#define FLAG_32     (1 << 14)   // 1 for 32-bit compat
#define FLAG_4k     (1 << 15)   // 4k page granularity
...

This is the easy, mechanical part of transferring the bitfield diagrams into useful code. Using these above flags we can define two constants that we’ll use for our 32-bit descriptors.


...
#define FLAGS_COMMON_32 (FLAG_USER | FLAG_R0 | FLAG_P | FLAG_32 | FLAG_4k)
#define FLAGS_CODE_32 (FLAG_CODE | FLAGS_COMMON_32)
#define FLAGS_DATA_32 (FLAG_DATA | FLAGS_COMMON_32)
...

In short, we use FLAG_USER to show that these are code and data descriptors (FLAG_SYSTEM would indicate a system type descriptor which is an entirely separate thing). FLAG_R0 because we are ring/privilege 0, the most privileged (kernel). FLAG_P for present, FLAG_32 for 32-bit operations, and FLAG_4k because we’re going to use a 4k size instead of bytes.

For each of these descriptors, we’re going to cover the entire 32-bit address space so base is 0 and the segment limit (4k size) is going to be 0xFFFFF (the maximum).

For our long mode descriptor, we have one additional flag:


/* Long mode descriptors have bit 13 as the "Long bit"
 *
 * A lot of the bits get ignored in long mode, but we'll set them anyway since
 * we're not there yet.
 */

#define FLAG_L          (1 << 13) // 1 for 64-bit
#define FLAGS_CODE_64 (FLAG_USER | FLAG_R0 | FLAG_P | FLAG_L | FLAG_4k | FLAG_CODE)

Basically our 64-bit flags are identical to 32-bit except we’ve added FLAG_L (long mode), and dropped FLAG_32 because they’re mutually exclusive (having both set is currently undefined behavior).

Before we actually define our table, let’s make a macro that will allow us to give our flags, base, and size in simple terms, and then generate the proper masks.


/* 1 = FLAGS, 2 = BASE, 3 = LIMIT */

%macro GDTENTRY 3
    DW  ((%3) & 0xFFFF)
    DW  ((%2) & 0xFFFF)
    DB  (((%2) & 0xFF0000) >> 16)
    DW  ((%1) | (((%3) & 0xF0000) >> 8))
    DB  (((%2) & 0xFF000000) >> 24)
%endmacro

This is a 3 argument NASM macro that will take our constant values and jockey their bits around to create an entry directly in the binary (DW and DB are pseudo instructions that embed words (16-bits) and bytes directly). If the &,|,>> confuses you, you may want to refresh your memory about bitwise operations.

With this macro in hand, we can then actually define the table:


ALIGN 8
GDT:
    /* NULL descriptor */
    GDTENTRY    0, 0x0, 0x0
    /* Code descriptor (Compat) */
CODE_SEL_32 EQU $-GDT
    GDTENTRY    FLAGS_CODE_32, 0x0, 0xFFFFF
    /* Data descriptor (Compat) */
DATA_SEL EQU $-GDT
    GDTENTRY    FLAGS_DATA_32, 0x0, 0xFFFFF
    /* Code descriptor (Long) */
CODE_SEL_64 EQU $-GDT
    GDTENTRY    FLAGS_CODE_64, 0x0, 0xFFFFF
GDTEND:

A bit of dissection. ALIGN 8 makes this long aligned, for access performance (processors take longer access byte offsets from unaligned bases. This is irrelevant to the actual operation of the GDT however.

GDT and GDTEND are labels we’ll use later. CODE_SEL_32 and friends are the calculated byte offsets from the beginning of the GDT. These are the values that we’ll place in our CS and DS-GS segment registers.

The rest of the lines are calls to our entry macro, using the flag combos we already defined and the base / limit we already discussed.

Also included here is the NULL descriptor that is required to be the first descriptor (and thus used if a segment register is loaded with 0x0).

The GDTR

Software can define as many descriptors in the GDT as they like (within reason), so the GDT can be any size. Since it can also be any place, the processor would have to store two registers worth of information to know the location and limit of the GDT. Well, hardware generally doesn’t use two registers when it can make do with one so instead of pointing the processor directly at the GDT, we point it at the GDTR a structure of known size.

Figures 4-7 and 4-8 in the programmer’s manual describe the GDTR in legacy and long mode. Fortunately, the only difference is the size of the address allowed for the GDT and it’s easy to make the two compatible.


ALIGN 8
GDTR:
DW (GDTEND - GDT - 1)

/* 8 bytes for long-mode, high bytes ignored in legacy */

DQ GDT

Load it up

To load the address of the GDTR, we make use of the special lgdt instruction, and put it into a basic assembly function to be called from our general code.


GLOBAL populate_gdt
populate_gdt:
    lgdt[GDTR]
    ret

Hosting the IDT

Structure

The IDT is very similar to the GDT, it’s a list of descriptors that we’ll point to with an IDTR (identical to the GDTR). The structure and purpose of each descriptor is a little different however, and it’s described in figure 4-24 of the AMD64 Programmer’s Manual v2.

There is one difference from the GDT and that’s that we can’t macro this one so easily. The reason is that each IDT entry describes a code descriptor (from the GDT) and an address to jump to if an interrupt is received (among other permission information). Each address points to an ISR (Interrupt Service Routine) and is only known at link-time, but we have to do the same masking and shifting we did with the GDT which requires a value at compile-time.

We could resolve this one of two ways. We could make it macro-able by assigning the ISRs to a section that we could give a known address in the linker script (so we would know the ISR addresses before link-time). Or, we could just init the entries at run-time. I believe that linker sections would be a good solution except for one thing: even though right now all of our ISRs are two byte stubs and they can be easily indexed a grouped, later we’re going to programmatically change the IDT entries anyway (albeit in C as drivers are loaded and request interrupts) and later we might want this boot code to put in actual values for built-in drivers which may not be of constant size and grouped into the ISR section.

Then again, perhaps we’ll route all of the IDTs to a single interrupt master function and the section approach would work perfectly with a little modification to the section sizing.

Regardless, I’ve decided to init the IDT at run-time.


/* idt.asm */
EXTERN CODE_SEL_64

/* The IDT holds a set of descriptors similar to the GDT, called
 * Interrupt Gates. Call and Trap Gates are also valid here.
 */
    
#define FLAG_INTERRUPT  0xe
    
/* These two are common with GDT descriptors */
    
#define FLAG_R0     (0 << 5)    // Rings 0 - 3
#define FLAG_P      (1 << 7)

#define IDT_ENTRIES     256

Here are the relatively few interesting values for the IDT. The 64-bit code descriptor offset we defined in gdt.asm, the FLAG_INTERRUPT which is basically the type, as well as the common FLAG_R0 and FLAG_P for Ring 0 (most privilege) and present. The number of IDT_ENTRIES is referenced as “VECTOR” in section 16.2 of the programmer’s manual.

Now we define all 256 of our ISRs. Each ISR is a simple infinite loop which may appear useless to us but for right now, we’re not capable of actually handing any of them and our debug environment can give us the current instruction pointer (IP) to tell us where we’re looping if we take an interrupt.


/* Generate 256 interrupt routines. */

/* 1 = ISR # */

%macro ISR 1
isr%1:
    jmp $
%endmacro

ISRS:
%assign i 0
%rep IDT_ENTRIES
ISR i
%assign i (i+1)
%endrep

#define ISR_SIZE (isr1 - isr0)

Now, allocate space for the IDTR.


/* IDTR, just like GDTR */

ALIGN 8
IDTR:
DW (IDTEND - IDT - 1)
DQ IDT

And then for each of the IDT entries


%macro IDTENTRY 0
    DD 0xabcdefab
    DD 0xabcdefab
    DD 0xabcdefab
    DD 0xabcdefab
%endmacro

ALIGN 8
IDT:
%assign i 0
%rep IDT_ENTRIES
IDTENTRY
%assign i (i+1)
%endrep
IDTEND:

Init and Load It

NOTE: We are just going ahead and setting up a 64-bit IDT, as you can see from our usage of VIRT_BASE in this chunk as well as the expanded side of the IDT entries. We will have interrupts masked (cli) until after we’re in a 64-bit environment, so this is okay. When we load the GDT, we’ll load the 32-bit address just so we’re self-hosted when we enable Long Mode, but then we return with a fixup function that will convert the IDTR to a 64-bit address afterwards.

Okay, so we’ve created a working IDTR, and a bunch of stub IDT entries. Similar to the GDT, we have to populate the IDT, and load the IDTR. This time it’s a bit more complicated because we’re actually initializing the IDT at runtime instead of compile time.


GLOBAL populate_idt
populate_idt:
    mov eax, IDT
    mov ebx, isr0
    or ebx, (VIRT_BASE & 0xFFFFFFFF)

idt_init_one:
    /* Target Low (word) */
    mov ecx, ebx
    mov word [eax], cx
    add eax, 2

    /* Code Selector (word) */
    mov word[eax], CODE_SEL_64
    add eax, 2

    /* IST (byte) */
    mov byte[eax], 0
    add eax, 1

    /* Flags (byte) */
    mov byte[eax], (FLAG_P|FLAG_R0|FLAG_INTERRUPT)
    add eax, 1

    /* Target High (word) */
    shr ecx, 16
    mov word[eax], cx
    add eax, 2

    /* Long Mode Target High 32 */
    mov dword[eax], (VIRT_BASE >> 32)
    add eax, 4

    mov dword[eax], 0
    add eax, 4

    add ebx, ISR_SIZE

    cmp eax, IDTEND
    jl idt_init_one

    lidt[IDTR]
    ret

This looks a lot more complex than it is. EAX is the pointer we’re writing to. EBX is the current isr address we defined, and ECX is just a holder to shift bits around. After we’ve done that for each ISR, we load the IDTR just like we loaded the GDTR.

At this point, we still can’t take an interrupt (since our table isn’t a valid 32-bit table) bit as soon as we switch to 64-bit, and update the IDTR, we’ll be prepared to take an interrupt… we just won’t be able to do anything intelligent for now.

Our Init At This Point

I’ve thrown a lot of code at you for doing all of this initialization, but we haven’t actually invoked it yet. It’s time to look at head.asm, which is the very earliest code in our kernel.


BITS 32

#include <early.h>

/* gdt.asm */
EXTERN populate_gdt
EXTERN DATA_SEL
EXTERN CODE_SEL_32
EXTERN CODE_SEL_64

/* idt.asm */
EXTERN populate_idt

entry:

    /* Disable interrupts while we move memory descriptors over from
     * whatever is left of GRUB.
     */

    cli

    /* Move the GRUB information pointer into EDI, which we won't
     * use for anything else until we call main()
     */

    mov edi, ebx

    /* Assume that GRUB setup enough stack for us to do a call. */
    /* This only does lgdt */

    call populate_gdt

    /* Setup data segments to new offset in our own GDT */
    mov eax, DATA_SEL
    mov ds, eax
    mov es, eax
    mov fs, eax
    mov gs, eax
    mov ss, eax

    /* Setup our stack, paging code will start to use it. */
    mov esp, (STACK_PAGES_PHYS + S_PAGES * PAGE_SIZE)
    mov ebp, esp

    /* Reload code segment, jumping to host_gdt load (not link) */
    jmp CODE_SEL_32:host_gdt
host_gdt:

    call populate_idt
    jmp $

This gets us far enough that we’re hosting our own IDT and GDT, and all of our segment registers, CS->GS and SS, are properly set to offsets within our new GDT. Trickily, CS can only be changed with a jump that specifies the CS, instead of a register move like all of the others.

Note that we define the stack related stuff in early.h, which we’ll get to later. For now, STACK_PAGES_PHYS is 0, and S_PAGES is 2, for an 8k stack. If you recall your basic C courses, the stack grows downward (i.e. the second thing on the stack will have a lower address than the first) so we init our stack registers, ESP and EBP, to point to the highest address.

We’re doing great, but now it’s time to move on to the tough part. Getting our initial page tables setup.

Initial Paging

Section 5.1 of the AMD64 Programmer’s Manual v2 covers the basics of paging. I’ll try to summarize.

To translate a single virtual address into its physical counterpart, you have to do four table look ups. Each table is a single, 4k page, containing 1024 (32-bit) or 512 (64-bit) entries. That entry will contain a physical address, as well as some flags, that will either be the next table’s physical address, or on the last table, the physical address that corresponds to the virtual address you started the lookup with.

The indices into this table are built directly into the virtual address. You can see how these break down in figure 5-1 of the Programmer’s Manual. These structures allow us to define some useful macros in include/early.h



/* 4k page size */
#define PAGE_SHIFT      12
#define PAGE_SIZE       (1 << PAGE_SHIFT)
#define PAGE_MASK       (PAGE_SIZE - 1)

#define PTE_SHIFT       (PAGE_SHIFT + 9*0)
#define PDE_SHIFT       (PAGE_SHIFT + 9*1)
#define PDPE_SHIFT      (PAGE_SHIFT + 9*2)
#define PML4E_SHIFT     (PAGE_SHIFT + 9*3)

/* Find index based on virtual address */
#define PTE(x)          (((x) >> PTE_SHIFT) & 0x1FF)
#define PDE(x)          (((x) >> PDE_SHIFT) & 0x1FF)
#define PDPE(x)         (((x) >> PDPE_SHIFT) & 0x1FF)
#define PML4E(x)        (((x) >> PML4E_SHIFT) & 0x1FF)

Since we’re currently in assembly, and these are C preprocessor macros, we have to be careful to only insert constants into them that the C preprocessor can resolve immediately. So, even in assembly we can do PTE(CONSTANT_ADDRESS) but PTE(eax) would fail as NASM can’t convert that into the appropriate assembly.

Some settings for page flags also become useful.


#define PF_P                (1 << 0) /* Present */
#define PF_RW               (1 << 1) /* Read/Write */
#define PF_USER             (1 << 2)
#define PF_WRITETHRU        (1 << 3)
#define PF_DISABLE_CACHE    (1 << 4)

First things first, however, we need to find out how big our kernel is, and how many pages we’ll need for all of its PML4/PDP/PD/PTs.


    /* Calculate size of early structures */

    /* eax = size of kernel, rounded up to page size */
    mov eax, kernel_size
    add eax, PAGE_MASK
    and eax, ~PAGE_MASK

    /* ebx = end of kernel address, rounded up*/
    mov ebx, eax
    add ebx, KERNEL_START

    /* Now we want a count of how many page table pages it will take to map
     * this. Because of our chosen offset, we get half of the first page table
     * (PTE(KERNEL_START) = 256).
     */

    /* We do get a full page directory however, so we can count on the number of
     * PD/PDP/PML4 pages being one a piece because I'm fairly confident that our
     * kernel will always be under 1Gb
     */

    /* ecx = pte index of first kernel page */
    mov ecx, PTE(KERNEL_START)

    /* edx = page structure count, initial PT and  PD/PDP/PML4 already counted. */
    mov edx, 4

count_early:
    sub eax, PAGE_SIZE

    /* If this would be the 512th PTE (starting from 0), it's actually PTE 0 
     * of another page table page. Roughly:
     *
     *  if (pte_idx == 512) {
     *      reserved_pages++;
     *      pte_idx = 0;
     *  }
     *  else {
     *      pte_idx++;
     *  }
     */

    /* if */
    cmp ecx, 512
    jne no_new_pt

    /* then */
    add edx, 1
    mov ecx, 0
    jmp ce_loop_end

    /* else */
no_new_pt:
    add ecx, 1

    /* remaining loop */
ce_loop_end:

    cmp eax, 0
    jne count_early



Not much to add on top of the comments.

We’re going to assume that GRUB didn’t put us right next to an unusable memory chunk, and that we can use the pages immediately after the kernel for these structures. Because these page structures are so sensitive to data, we zero from the end of the kernel (page aligned) up to EDX pages after it.


    /* Summary:
     * ebx = page aligned kernel end address
     * edx = number of pages tables needed to fully map kernel
     *  + 4 for initial PT and PD/PDP/PML4
     */

    /* ecx = end of kernel paging addresses <- saving this for later*/
    mov ecx, edx
    shl ecx, PAGE_SHIFT
    add ecx, ebx
    add ecx, PAGE_SIZE

    mov eax, ebx
zero_page_mem:
    mov dword [eax], 0x0
    add eax, 4
    cmp eax, ecx
    jne zero_page_mem

The Recursive Page Table

We’re ready to start working on the page tables themselves, with EDX pages of zeroed scratch memory @ EBX.

We’ll use them in order to create each table, PML4 / PDP / PD / PT, for the kernel immediately after it in memory. Here’s an example memory map.


---- Page aligned end of kernel (EBX)
    1 4k page PML4
----------------
    1 4k page PDP
----------------
    1 4k page PD
----------------
    n 4k page PTs
---- End of page structures (ECX)

PML4 - Page Map Level 4
PDP - Page Directory Pointer
PD - Page Directory
PT - Page Table

We’re mapping both the target virtual address (VIRT_BASE | KERNEL_START => KERNEL_START) as well as the identity map (KERNEL_START => KERNEL_START) so that when first start paging, we’re still executing in a valid address space.

As I mentioned above, translating from a virtual address to a physical address requires a series of lookups. Once we enable paging, however, we’ll only have direct access to the memory that we have recorded into the page tables, there’s no way to read or write physical addresses anymore. Logically, then, the memory the page tables resides in should be mapped into itself so that you can modify them without disabling paging (which would likely be a disaster anyway). Fortunately, the design of x86_64 architecture, like the design of the i386 before it, uses a recursive data structure – meaning that each each table’s entries are compatible. So an entry in the first lookup table, the PML4, looks the same as a valid entry in the PDP, PD, and PT data structures as well. This means that if you make an entry in the PML4 that is the PML4’s address itself then the PML4 is also a valid PDP that has the PML4 mapped into it (because it’s the same physical memory location) which makes it a PD with the PML4 mapped into it, which makes it a PT with the PML4 mapped into it, which means that the PML4 itself is a leaf node, so its physical memory is mapped somewhere in your address space.

I’ll cover the process of finding the virtual addresses of arbitrary page structures in the next article. For now it’s enough to know that because the same recursive effect occurs for each page table structure, mapping the PML4 into itself is the best way to ensure that all of your page tables are accessible at all times.

This is the first thing we tackle while we setup the PML4.

PML4

NOTE The following is a lot of basic pointer math in C, and it’s not too bad in assembly either. I’ve grouped them such that EAX is always the address we’re writing too, and EDX is always the value we’re going to write to it.


    /* First, map the PML4 into itself*/

    /* eax = address of PML4[510]*/
    mov eax, ebx
    add eax, (8 * 510)

    /* edx = PML4 address + flags */
    mov edx, ebx
    or  edx, (PF_RW | PF_P)

    mov dword [eax], edx

8 being the size of each PML4 entry, 510 being the index. We chose 510, second to last, because the virtual kernel mapping is going to be in entry 511 so we’re grouping kernel resources at the far end of the address space.

Here we map both the target virtual address (VIRT_BASE | KERNEL_START => KERNEL_START) as well as the identity map (KERNEL_START => KERNEL_START)


    /* Now, map two PDPs, one where we want our kernel, and one
     * where we'll end up after we start paging but before we jump to our
     * kernel address. Here's the break down of our two addresses:
     *
     * KERNEL_START (0x100000) - where we are running when paging turns on
     * PML4E:  0
     * PDPE:   0
     * PDE:    0
     * PTE:    256
     *
     * VIRT_BASE | KERNEL_START - where we want to run
     * PML4E:  511
     * PDPE:   510
     * PDE:    0
     * PTE:    256
     *
     * We're going to be lazy and merge these together because we're mapping
     * the identical content and because we'll clean up immediately after
     * paging is enabled. Looking at the above it's clear that we should
     * eliminate PML4E[0] and PDPE[0], but leave the identical PDEs and PTEs
     * in place.
     *
     * We can also see that, if we are flexible in the number of PTs, we'd
     * have to have 512 of them before we'd have to allocate another PD. Since
     * 512 PTs can map 1GB of memory, I don't think that's an issue for our
     * kernel, thus we're safe hardcoding 1 PML4/PDP/PD page.
     */

    /* First, the scrap mapping */

    mov eax, ebx
    add eax, 8 * PML4E(KERNEL_START)

    mov edx, ebx
    add edx, PAGE_SIZE
    or  edx, (PF_RW | PF_P)

    mov dword [eax], edx

    /* Now, the real entry */

    mov eax, ebx
    add eax, 8 * PML4E(VIRT_BASE | KERNEL_START)

    mov dword [eax], edx

So now we’ve setup the PML4 @ EBX, that links to itself, as well as the PDPs we’re about to contruct in the following two pages.

Page Directory Pointer


    /* Onto the PDP */

    mov eax, ebx
    add eax, (8 * PDPE(KERNEL_START)) + PAGE_SIZE

    mov edx, ebx
    add edx, 2*PAGE_SIZE
    or  edx, (PF_RW | PF_P)

    mov dword [eax], edx

    mov eax, ebx
    add eax, (8 * PDPE(VIRT_BASE | KERNEL_START)) + PAGE_SIZE

    mov edx, ebx
    add edx, 2*PAGE_SIZE
    or  edx, (PF_RW | PF_P)

    mov dword [eax], edx

Note the PAGE_SIZE offsets to avoid the PML4 in the pointer (EAX) as well as the 2*PAGE_SIZE offset in the data we’re writing it. We’re being lazy, as I mentioned in the comment, by having a single PDP. Technically here we’re creating four separate addresses that map the kernel, but we just have to remember to clean it up when we’re done.

Page Directory

Now we come across the first bit of paging initialization where we don’t know exactly how many page table structures we’re setting up at compile time. Fortunately, at run time we already calculated where the last page table ends and kept that value in ECX, so just map every page aligned address between EBX + 3*PAGESIZE (end of kernel + a page each for PML4/PDP/PD aka the start of the Page Tables) down to, but not including ECX, into EBX + 2*PAGESIZE (end of kernel + a page apiece for PML4/PDP, aka the start of the Page Directory).


    /* Now the PD, which is the same for both */

    mov eax, ebx
    add eax, (8 * PDE(KERNEL_START)) + (2*PAGE_SIZE)

    mov edx, ebx
    add edx, 3*PAGE_SIZE

    /* Remember ecx? It's end address of paging structures, use it to know when
     * we've mapped all the PTs
     */

write_next_pde:
    mov esi, edx
    or  esi, (PF_RW | PF_P)

    mov dword [eax], esi

    add eax, 8
    add edx, PAGE_SIZE
    cmp edx, ecx
    jne write_next_pde

Note that we bring in ESI here, but only so that the comparison between EDX and ECX can be made without having to or our page flags into ECX and clean it up. It’s just sitting there anyway.

We loop writing consecutive page directory entries from EDX up to ECX. These addresses are now our page tables.

Page Tables

Similar to the page directory setup, we’re not entirely sure how many page tables or entries we’ll need at compile time, but at runtime we know the details. We’ve set the stack at STACK_PAGES_PHYS, and we know exactly how many pages we’re going to use for that regardless of the size of ther kernel, so we macro that, and since we’ve chosen STACK_PAGES_START to be right on top of the kernel (STACK_PAGES_START = KERNEL_START – (S_PAGES * PAGE_SIZE)), we can just keep EAX incrementing to write the PTEs for the kernel itself from KERNEL_START down to EBX.


    mov eax, ebx
    add eax, (8 * PTE(STACK_PAGES_START)) + (3*PAGE_SIZE)

%assign i 0
%rep S_PAGES
    mov dword [eax], ((STACK_PAGES_PHYS + PAGE_SIZE * i) | PF_RW | PF_P)
    add eax, 8
%assign i (i+1)
%endrep

    /* Whose PTEs are adjacent to the kernel's so we don't need to mess with
     * eax
     */

    mov edx, KERNEL_START

write_next_pte:
    mov esi, edx
    or  esi, (PF_RW | PF_P)

    mov dword [eax], esi

    add eax, 8
    add edx, PAGE_SIZE
    cmp edx, ebx
    jne write_next_pte

Throwing the Switch into Long Mode

Whew. It’s been quite a long post, with a lot to cover. Now we’ve got our paging structures ready, and our GDT/IDT self-hosted, we can finally get into Long Mode.

Section 14.6.1 of the AMD64 Programmer’s Manual v2 describes the process to enable Long Mode, that we’ll follow here.


    /* Enable CR4.PAE (bit 5) */
    mov eax, cr4
    or  eax, (1 << 5)
    mov cr4, eax

    /* Put PML4 address in CR3 */
    mov cr3, ebx

    /* Set EFER.LME (bit 8) to enable Long Mode
     * EFER is a model specific register (MSR).
     * To access MSRs, you place their "address" (not memory)
     * into call rdmsr to read it into edx:eax. wrmsr will 
     * also take edx:eax and write it to the msr
     */

    mov ecx, 0xc0000080
    rdmsr
    or  eax, (1 << 8)
    wrmsr

    /* Set CR0.PG (bit 31) to enable paging */
    mov eax, cr0
    or  eax, (1 << 31)
    mov cr0, eax

    /* Get into 64-bit to perform jump to 0xFFFF8...*/
    jmp 0x18:cleanup_32
cleanup_32:

    /* Jump to proper 0xF... address of cleanup_64 */
BITS 64

    mov rax, VIRT_BASE
    add rax, cleanup_64
    jmp rax

cleanup_64:

Not much to add on the comments, but note that once again we have to perform a jump in order to switch code selectors to the new 64-bit selector.

Now that we’re in 64-bit paging mode, we have just a few more housekeeping duties before we call main(). The stack registers are now invalid, so we update them before we do anything else. The fixup functions just write the new virtual addresses to the GDTR and IDTR and reload them.

Finally, remove the kernel identity map from the PML4, and the extra PDP entry, which reduces the number of kernel address ranges back down to one, where it should be, and jump into main.


    /* Update our stacks to the new paged mapping, resetting the stack */

    mov rax, (VIRT_BASE + (KERNEL_START - 8))
    mov rsp, rax
    mov rbp, rax

    /* Move GDTR / IDTR addresses to new 0xFFFF8....... */
    call fixup_gdtr
    call fixup_idtr

    /* Now that we are executing in 0xF... space, and our
     * stacks are there too, we can clean up the 0x0 kernel
     * mapping that had to be in place for us to successfully
     * return from setting CR0.PG
     */

    /* Unfortunately I can't just use the C macros directly */

    /* Also note that it's important to do this from the bottom up, on qemu (if
     * not hardware), the PDPE mapping disappears with the PML4E mapping, despite
     * cr3 not being reset
     */

    /* mov rax, PDPE_ADDR(KERNEL_START) */
    mov rax, 0xffffff7fbfc00000
    mov dword [rax], 0x0

    /* mov rax, PML4E_ADDR(KERNEL_START) */
    mov rax, 0xffffff7fbfdfe000
    mov dword [rax], 0x0

    /* Reset CR3 to update. */
    mov rax, cr3
    mov cr3, rax

    /* Move to rax first to avoid linker relocation truncation. */
    mov rax, main

    call rax
    jmp $   // We should never get here.

The Code

To look at anything not covered explicitly here, like the *DTR fixup functions, where I got those PML4E/PDPE address values in the cleanup, or the build system, you can browse the code. The tag for this article is “long-mode”.

Next Time

The next section will be almost entirely C, thankfully, and will center around writing an industrial grade page allocator similar to Linux’s.

One thought on “Long Mode

  1. This is a high quality post. The code is neat and well commented. It really helped getting the long mode to work under multiboot spec, there is not much about it on the internet. Can’t wait for the next post.

Leave a Reply

Your email address will not be published. Required fields are marked *