Linux Kernel demand paging: mapping anonymous memory into a user process's address space.



This article is based on Linux v6.0.12 and the x86 architecture.


When a user process requests anonymous memory (not backed by a file) from the Kernel through a system call like malloc or brk, the Kernel expands the process's address space. A process's address space is defined by vm_area_struct instances, which describe memory regions. A process's address space is expanded by adding new memory regions or expanding the existing ones.

When a userprocess's address space is expanded, the Kernel does not immediately allocate any physical memory for these new virtual addresses. Instead, the Kernel implements demand paging, where a page will only be allocated from physical memory and mapped to the address space when the user process tries to write to that new virtual memory address. If a page cannot be allocated from memory due to memory constraints, then the Kernel will swap out the least recently used pages onto disk at an area known as the swap space.

In this article, I'll be taking a look at the Linux Kernel code that handles page faults. I want to see how the Kernel allocates page frames and maps them into the address space of user processes.

Code Investigation

Up until Linux v6.1, the entry point in x86 for page faults was the assembly routine page_fault. However, this commit in 2020 did some rearranging.

Figure 1

Now, the C entry point is DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault) in /arch/x86/mm/fault.c (code link).

For the 32-bit x86 architecture, the virtual memory address space has a 3:1 split, 3 GiB for user space addresses and the top 1 GiB for Kernel space addresses. The Kernel uses some of this 1 GiB to directly map some of the physical address space. By reserving this upper portion of memory for Kernel virtual memory address space, a Kernel thread can use the set of page tables of the last running process process and avoid TLB and cache flushes.

Figure 2
Image source
Linux Memory Management Documentation.

Within DEFINE_IDTENTRY_RAW_ERRORCODE, we see a call to handle_page_fault(regs, error_code, address). This function checks whether the address that triggered the fault belongs to the Kernel's address space or the user's address space, and then calls do_kernel_addr_fault or do_user_addr_fault.

Figure 3
Code link

I'm interested in the case where a user process attempts to access a valid address in it's virtual address space for the first time. Thus, I'm going to skip to do_user_addr_fault (code link). This function begins with a few checks:

  1. Is this Kernel mode code trying to execute from user memory?
  2. Are the page table entry bits reserved bits valid?
  3. If SMAP is enabled, check for invalid Kernel access to user pages in the user addres space
  4. Are we in an interrupt? Ignore the fault if so.
  5. See if it's safe to enable irqs.
  6. Is this a fault in a vsyscall page that needs emulation?

After performing the above checks and acquiring the necessary locks, the Kernel acquires the memory descriptor, mm_struct, of the task that was executing when the page fault occurred, and looks for the memory region that contains the faulty address. Most of this logic happens in the find_vma function.

Figure 4
Code link

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) is an interesting and important function. It's defined in /mm/mmap.c (code link). From "Understanding the Linux Kernel"[3]:
The find_vma() function acts on two parameters, the address mm of a memory descriptor and a linear address addr. It locates the first memory region whose vm_end field is greater than addr and returns the address of its descriptors.

Figure 5
Code link

This function reveals one cool thing about the way these memory region descriptors are stored. In addition to being stored in a sorted linked list, they're also stored in a red black tree. We get the root of the tree with the statement
rb_node = mm->mm_rb.rb_node;
and then we search through the tree with binary search. In a process with many memory regions, reducing the time complexity of this function from linear to logarithmic time is very important.

Let's step back up to figure 4, the code where we call find_vma. If we assume that the address is within a valid memory region of the process, we will enter the conditional
if (likely(vma->vm_start <= address)) and we will
goto good_area.

Figure 6

First we check to see if there was an access error - e.g. the process tried to write to the address but it doesn't have write permissions. Assuming permissions are good, we call handle_mm_fault which does some more checks and then calls __handle_mm_fault(vma, address, flags, regs);. This function is located in /mm/memory.c and is common to all the different architectures (code link). This function is in charge of allocating page tables and page table entries for the virtual address that triggered the page fault.

Figure 7: The beginning of _handle_mm_fault

The vm_fault instance is used to store information about the page fault and pass it to handle_pte_fault. Five-level paging is now supported in many intel processors. With five level paging, a virtual address is split up the following way:

Figure 8
Image Source

Let's look at several key lines of _handle_mm_fault.

pgd = pgd_offset(mm, address);
This function returns the virtual address of the entry in the Page Global Directory that corresponds to the address.

p4d = p4d_alloc(mm, pgd, address);
This function allocates or returns an existing P4 Directory entry for the given address. Since P4 Directories are indexed by Page Global Directory entries, we're required to pass in the Page Global Directory entry that we found above. My assumption is that for architectures without five level paging, this just returns the Page Global Directory entry.

vmf.pud = pud_alloc(mm, p4d, address);
This function allocates or returns an existing Page Upper Directory entry for the given address. This entry is attached to the vm_fault instance.

vmf.pmd = pmd_alloc(mm, vmf.pud, address);
This function allocates or returns an existing Page Middle Directory entry for the given address. This entry is also attached to the vm_fault instance. This PMD entry allows us to access the last level in our paging scheme- the page table.

Once we've allocated the required page table entries to map the faulting virtual memory address, we call return handle_pte_fault(&vmf) (code link).
Let's look at the following lines from the handle_pte_fault function:

Figure 9

pte_offset_map accepts a pointer to a Page Middle Directory entry and a virtual address. It returns the address of the page table entry that maps the virtual address. Assuming that this is the first time the user process has tried accessing the address, the returned address would point to an empty, or uninitialized, page table entry. Let's look further down in the same function.

Figure 10

Since the page table entry hasn't been initialized, !vmf->pte will return true. Let's assume that the virtual address is part of an anonymous memory region, which means it's not backed by a file, as would be the case after a process attempts to use an address returned by the C stdlib function malloc. We'll enter do_anonymous_page (code link).

One of the first things do_anonymous_page does is allocate a page table entry.

Figure 11

pte_alloc accepts a memory descriptor and the address of a pmd entry. It will allocate and initialize a new page table entry (with the User/Supervisor flag set) and return the virtual address of it [3]. However, this entry will not have it's address field set yet.

On the Intel x86 architecture, a page table entry has the following format:

Figure 12
Image source

The address field points to a physical memory address. This is the address that the memory management unit translates the virtual address associated with this page table entry to. This field leaves out the least significant 12 bits of the physical address. This is because page table entries are aligned to Page boundaries, which are at intervals of 4096 Bytes (2^12). Thus, the memory management unit can assert that the lower 12 bits of the address field are always 0.

Further down in do_anonymous_page, after the page table entry has been allocated, we see the following code:

Figure 13

This is the code that handles page faults on reads. We already know that the faulting virtual address maps to anonymous memory and that this is the first time this virtual address was accessed, which is why the page fault occurred. Thus, I think that the Kernel maps the page table entry to a common "Zero Page" and makes this page write protected. This will cause another Page Fault to occur when this page is written to. This is an example of Copy on Write behavior [1].

Right below the above code is where we handle page faults on writes. It starts with the following block:

Figure 14

The conditional if (unlikely(anon_vma_prepare(vma))) is used to make sure that the memory region described by vma has an anon_vma attached to it. This data structure is used to collect the regions which reference a given anonymous memory region into a doubly linked circular list. I'm not sure which other code path makes use of this list.

The allocation of anon_vma will only need to happen the first time we map a page frame into the anonymous region [3]. Therefore, the common case is that anon_vma already exists, which is why we give the compiler a "heads up" with unlikely.

The macro alloc_zeroed_user_highpage_movable(vma, vmf-address) is defined as alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, vma, vaddr). This function will allocate a zeroed out page frame for us. This is a physical page of memory!
The Kernel code has a comment that describes the "GFP_HIGHUSER_MOVABLE flag:

"GFP_HIGHUSER_MOVABLE is for userspace allocations that the Kernel does not need direct access to but can use kmap() when access is required."

The kmap() function is used by Kernel code to map high memory pages into low memory. Recall that (on x86) low memory is the section of the Kernel virtual address space which directly maps the first 896 MB of physical memory, while high memory is physical memory that is not directly mapped into the Kernel's virtual address space, and which must be temporarily mapped into it.

The other flag, __GFP_ZERO, makes sure that we are given zeroed memory.

After the physical memory page is allocated, we create the page table entry value that we will insert into our newly allocated page table entry position:

Figure 15
Code link

The function make_pte generates a page table entry value that points to the newly allocated page. Refer back to figure 12 for the format of page table entries. Let's take a deeper look at this function.

This function, make_pte, has to get the physical address of the page in memory that the struct page instance is associated with. Then, it has to add in the page protection and status bits (vma->vm_page_prot) with simple bitwise operations.

Before we look at the source code for this function, let's recall how the Kernel stores struct page instances. In a FLATMEM memory model, there is a global mem_map array that maps the entire physical memory [5]. Each index in the array corresponds to a page frame number (PFN). For example, index 2 contains the struct page instance for page frame number 2. Now, assuming a 32-bit system with 4 GiB of memory and 4096 byte pages, then you will have 2^20 (1,048,576) page frames. Thus, this will be the number of page instances within mem_map array.

If you have a struct page instance, and you know its page frame number (pfn), then you can perform some arithemtic (pfn * 4096 bytes) to get the physical address that the struct page corresponds to.

Let's look at the source code for make_pte:

Figure 16
Code link

page_to_pfn in a FLATMEM model does some arithmetic to get the page frame number associated with a page:

((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)

It subtracts the address of the first element in the mem_map array , which contains the struct page instance for the 0th page frame, from the address of the given page.
ARCH_PFN_OFFSET defines the first page frame number for systems with physical memory starting at addresses different from 0. [5]. In our case, we can assume that this value is zero.

Due to the behavior of C pointer arithmetic, the expression above will return us the index of the page instance within the mem_map array. This index is the page frame number! As seen in figure 16, this page frame number and the page protection flags are then passed to the pfn_pte function. This function generates a page table entry which points to the physical address of the page frame and has the page protection flags set

Let's look at the source code for pfn_pte:

Figure 17
Code link

In the first line, we calculate the physical address of the page frame. By shifting the page frame number to the left PAGE_SHIFT bits, we are multiplying the page frame number by the page frame size, 4096 bytes. This gives us the physical address of the page frame.

Now that we have a page table entry which maps our physical address, we can step back up to do_anonymous_page:

Figure 18
Code link

entry = mk_pte(page, vma->vm_page_prot
This is what we just covered above. It creates a page table entry for the given page with the proper page protection flags.

entry = pte_sw_mkyoung(entry)
I think that on most architectures, the hardware handles setting a page table entry's access bit. However, some architectures require the programmer to manually do this. For these latter architectures, I think that this function will set the access bit, while for the former architectures, it is a noop.

entry = pte_mkwrite(pte_mkdirty(entry))
If the access to the faulting memory address was a write, we set the Write and Dirty flags on the page table entry, to indicate that this page frame is writable and that it has been written to [2] [4].

If we skip to the end of the do_anonymous_page function, we see the following code:

Figure 19
Code link

Recall that vmf->pte is the page table entry that the faulting address, vmf->address, maps to. Thus, it's safe to assume that
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry)
sets the actual value of the page table entry to the value of the entry that we created above.

update_mmu_cache(vma->vma, vmf->address, vmf->pte)
This looks like a noop on x86.

I think that this just frees up memory associated with the in-memory representation of the page.


And there you have it. We traced the page faults from handle_page_fault to the end of do_anonymous_page. Along the way, we saw the Kernel