## S Case studies: x86, xv6, Linux

Diagrams from:

- Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3: System Programming Guide
- AMD64 Architecture Programmer's Manual, Volume 2: System Programming
- xv6 Commentary



### x86 VM support

- x86 (aka IA-32) supports segmentation & paging in 32-bit protected mode
- x86-64 (aka IA-32e) introduces 64-bit (nominal) mode
  - Segmentation is mostly deprecated in favor of paging
- Support for coexisting normal and "huge" pages



# 32-bit segmentation + paging



**College of Computing** 



### 32-bit segmentation



### **Segmentation registers**

| Visible Part     | Hidden Part                             | _  |
|------------------|-----------------------------------------|----|
| Segment Selector | Base Address, Limit, Access Information | CS |
|                  |                                         | SS |
|                  |                                         | DS |
|                  |                                         | ES |
|                  |                                         | FS |
|                  |                                         | GS |

### **Segment descriptor format**

| 31                 |            | 24 23 | 22          | 21 | 20          | 19 16                  | 15 | 14 13       | 12 | 11 8      | 7           | 0 |
|--------------------|------------|-------|-------------|----|-------------|------------------------|----|-------------|----|-----------|-------------|---|
|                    | Base 31:24 | G     | D<br>/<br>B | L  | A<br>V<br>L | Seg.<br>Limit<br>19:16 | Ρ  | D<br>P<br>L | S  | Туре      | Base 23:16  |   |
| 31                 | 1 16       |       |             |    |             |                        |    |             |    |           |             | 0 |
| Base Address 15:00 |            |       |             |    |             |                        |    |             |    | Segment I | Limit 15:00 |   |

- 64-bit code segment (IA-32e mode only)
- AVL Available for use by system software
- BASE Segment base address
- Default operation size (0 = 16-bit segment; 1 = 32-bit segment) D/B
- Descriptor privilege level DPL
- Granularity G
- LIMIT Segment Limit
- Segment present
- Descriptor type (0 = system; 1 = code or data) S
- TYPE Segment type

### **ILLINOIS TECH College of Computing**

0

4

### 32-bit xv6 segment initialization

4

0

### **Segment descriptor format**

| 31                 |            | 24 23 | 22          | 21 | 20          | 19 16                  | 15 | 14 13       | 12 | 11 8      | 7           |
|--------------------|------------|-------|-------------|----|-------------|------------------------|----|-------------|----|-----------|-------------|
|                    | Base 31:24 | G     | D<br>/<br>B | L  | A<br>V<br>L | Seg.<br>Limit<br>19:16 | Ρ  | D<br>P<br>L | S  | Туре      | Base 23:16  |
| 31                 |            |       |             |    | 16          | 15                     |    |             |    |           |             |
| Base Address 15:00 |            |       |             |    |             |                        |    |             |    | Segment I | _imit 15:00 |

- 64-bit code segment (IA-32e mode only) L
- AVL Available for use by system software
- BASE Segment base address
- D/B Default operation size (0 = 16-bit segment; 1 = 32-bit segment)
- DPL Descriptor privilege level
- Granularity G
- LIMIT Segment Limit
- Segment present
- Descriptor type (0 = system; 1 = code or data) S

TYPE — Segment type

```
struct segdesc {
 uint lim_15_0 : 16; // Low bits of segment limit
 uint base_15_0 : 16; // Low bits of segment base address
 uint base_23_16 : 8; // Middle bits of segment base address
 uint type : 4; // Segment type (see STS_ constants)
 uint s : 1; // 0 = system, 1 = application
 uint dpl : 2; // Descriptor Privilege Level
 uint p : 1; // Present
 uint lim_19_16 : 4; // High bits of segment limit
 uint avl : 1;
                    // Unused (available for software use)
 uint rsv1 : 1; // Reserved
 uint db : 1;  // 0 = 16-bit segment, 1 = 32-bit segment
 uint g : 1; 	// Granularity: limit scaled by 4K when set
 uint base_31_24 : 8; // High bits of segment base address
};
```

```
#define SEG(type, base, lim, dpl) (struct segdesc)
{ ((lim) >> 12) & 0xffff, (uint)(base) & 0xffff,
  ((uint)(base) >> 16) & 0xff, type, 1, dpl, 1,
  (uint)(lim) >> 28, 0, 0, 1, 1, (uint)(base) >> 24 }
```



### 32-bit xv6 segment initialization

```
struct segdesc {
                                                                   void
 uint lim_15_0 : 16; // Low bits of segment limit
                                                                   seginit(void)
 uint base_15_0 : 16; // Low bits of segment base address
 uint base_23_16 : 8; // Middle bits of segment base address
                                                                    struct cpu *c;
 uint type : 4; // Segment type (see STS_ constants)
 uint s : 1; // 0 = system, 1 = application
                                                                    c = &cpus[cpuid()];
 uint dpl : 2; // Descriptor Privilege Level
                                                                    c->gdt[SEG_KCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, 0);
 uint p : 1; // Present
                                                                     c->gdt[SEG_KDATA] = SEG(STA_W, 0, 0xffffffff, 0);
 uint lim_19_16 : 4; // High bits of segment limit
                                                                     c->gdt[SEG_UCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, DPL_USER);
 uint avl : 1; // Unused (available for software use)
                                                                     c->gdt[SEG_UDATA] = SEG(STA_W, 0, 0xffffffff, DPL_USER);
 uint rsv1 : 1; // Reserved
                                                                    lgdt(c->gdt, sizeof(c->gdt));
 uint db : 1;  // 0 = 16-bit segment, 1 = 32-bit segment
                                                                   }
 uint g : 1; 	// Granularity: limit scaled by 4K when set
 uint base_31_24 : 8; // High bits of segment base address
};
```

```
#define SEG(type, base, lim, dpl) (struct segdesc)
{ ((lim) >> 12) & 0xffff, (uint)(base) & 0xffff,
 ((uint)(base) >> 16) & 0xff, type, 1, dpl, 1,
  (uint)(lim) >> 28, 0, 0, 1, 1, (uint)(base) >> 24 }
```



### Flat vs. Multi-segment models







### 32-bit 4KB vs. 4MB pages







### 32-bit 4KB xv6 page table walk/alloc



```
static pte_t *
walkpgdir(pde_t *pgdir, const void *va, int alloc)
٤
  pde_t *pde;
  pte_t *pgtab;
  pde = &pgdir[PDX(va)];
  if(*pde & PTE_P){
    pgtab = (pte_t*)P2V(PTE_ADDR(*pde));
  } else {
    if(!alloc || (pgtab = (pte_t*)kalloc()) == 0)
     return 0;
    memset(pgtab, 0, PGSIZE);
    *pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
  return &pgtab[PTX(va)];
```







### 32-bit PDE/PTE

### CR3 and paging structure entries

| 302928272625242322212019181716151413                                                          |                             | 6 5 1 2                                     |              |               |                    |                                                                                                                                                                                     |
|-----------------------------------------------------------------------------------------------|-----------------------------|---------------------------------------------|--------------|---------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                               |                             | 0 5 4 5<br>P F                              |              |               | Bit<br>Position(s) | Contents                                                                                                                                                                            |
| Address of page directory <sup>1</sup>                                                        | Ignored                     | C   V<br>  D   T                            | V Ignored    | d CR3         | 0 (P)              | Present; must be 1 to map a 4-KByte page                                                                                                                                            |
| Bits 39:32                                                                                    | D                           |                                             |              | PDE:          | 1 (R/W)            | Read/write; if 0, writes may not be allowed to the 4-KByte page referenced by this entry (see Section 4.6)                                                                          |
| Bits 31:22 of address<br>of 2MB page frameReserved<br>(must be 0)Bits 39.52<br>of<br>address2 | A Ignored G <u>1</u> [<br>T | Ignored G 1 D A C W / / <u>1</u><br>D X S W |              |               | 2 (U/S)            | User/supervisor; if 0, user-mode accesses are not allowed to the 4-KByte page referenced by this entry (see S 4.6)                                                                  |
| Address of page table                                                                         | Ignored <u>0</u>            | I P F<br>g A C V                            | PUR<br>V///1 | PDE:<br>page  | 3 (PWT)            | Page-level write-through; indirectly determines the memory type used to access the 4-KByte page referenced l<br>entry (see Section 4.9)                                             |
|                                                                                               |                             | n D 1                                       |              | table<br>PDE: | 4 (PCD)            | Page-level cache disable; indirectly determines the memory type used to access the 4-KByte page referenced tentry (see Section 4.9)                                                 |
| Ignored                                                                                       |                             |                                             |              | ) not         | 5 (A)              | Accessed; indicates whether software has accessed the 4-KByte page referenced by this entry (see Section 4                                                                          |
|                                                                                               |                             |                                             |              | present       | 6 (D)              | Dirty; indicates whether software has written to the 4-KByte page referenced by this entry (see Section 4.8)                                                                        |
| Address of 4KB page frame                                                                     |                             |                                             |              |               | 7 (PAT)            | If the PAT is supported, indirectly determines the memory type used to access the 4-KByte page referenced b entry (see Section 4.9.2); otherwise, reserved (must be 0) <sup>1</sup> |
|                                                                                               |                             |                                             |              | page          | 8 (G)              | Global; if CR4.PGE = 1, determines whether the translation is global (see Section 4.10); ignored otherwise                                                                          |
| Ignored                                                                                       |                             |                                             |              | PTE:          | 11:9               | Ignored                                                                                                                                                                             |
|                                                                                               |                             |                                             |              | present       | 31:12              | Physical address of the 4-KByte page referenced by this entry                                                                                                                       |

**NOTES:** 

### 4KB PTE breakdown



#define EXTMEM 0x100000

// Start of extended memory

#define KERNBASE 0x8000000

4096 *// bytes mapped by a page* 

```
static int
                                    // Top physical memory
#define PHYSTOP 0xE000000
                                                                              mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
                                    // Other devices are at high addresses
#define DEVSPACE 0xFE000000
                                                                              Ę
                                                                                char *a, *last;
                                 // First kernel virtual address
                                                                                pte_t *pte;
#define KERNLINK (KERNBASE+EXTMEM) // Address where kernel is linked
                                                                                a = (char*)PGROUNDDOWN((uint)va);
#define PGSIZE
                                                                                last = (char*)PGROUNDDOWN(((uint)va) + size - 1);
#define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1))
                                                                                for(;;){
#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))
                                                                                  if((pte = walkpgdir(pgdir, a, 1)) == 0)
                                                                                   return -1;
#define V2P(a) (((uint) (a)) - KERNBASE)
                                                                                  if(*pte & PTE_P)
                                                                                   panic("remap");
#define P2V(a) ((void *)((char *) (a)) + KERNBASE))
                                                                                  *pte = pa | perm | PTE_P;
// This table defines the kernel's mappings, present in every process
                                                                                  if(a == last)
static struct kmap {
                                                                                   break;
  void *virt;
                                                                                  a += PGSIZE;
  uint phys_start;
                                                                                  pa += PGSIZE;
                                                                                3
  uint phys_end;
                                                                                return 0;
  int perm;
} kmap[] = {
                                                                              }
 { (void*)KERNBASE, 0,
                                   EXTMEM,
                                              PTE_W}, // I/O space
                                                     // kern text+rodata
 { (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0},
                                              PTE_W}, // kern data+memory
 { (void*)data,
                    V2P(data),
                                   PHYSTOP,
                                              PTE_W}, // more devices
 { (void*)DEVSPACE, DEVSPACE,
                                   0,
};
```

**ILLINOIS TECH** 

**College of Computing** 





#define EXTMEM 0x100000 #define PHYSTOP 0xE000000 #define DEVSPACE 0xFE000000

// Start of extended memory // Top physical memory // Other devices are at high addresses

#define KERNBASE 0x8000000 // First kernel virtual address #define KERNLINK (KERNBASE+EXTMEM) // Address where kernel is linked

#define PGSIZE 4096 // bytes mapped by a page #define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1)) #define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))

```
#define V2P(a) (((uint) (a)) - KERNBASE)
#define P2V(a) ((void *)((char *) (a)) + KERNBASE))
```

```
// This table defines the kernel's mappings, present in every process
static struct kmap {
  void *virt;
  uint phys_start;
  uint phys_end;
  int perm;
} kmap[] = {
 { (void*)KERNBASE, 0,
                                   EXTMEM,
                                              PTE_W}, // I/O space
 { (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0},
                                              PTE_W}, // kern data+memory
 { (void*)data,
                    V2P(data),
                                   PHYSTOP,
                                              PTE_W}, // more devices
 { (void*)DEVSPACE, DEVSPACE,
                                   0,
};
```

```
// Set up kernel part of a page table.
pde_t*
setupkvm(void)
٤
  pde_t *pgdir;
  struct kmap *k;
  pgdir = (pde_t*)kalloc());
  memset(pgdir, 0, PGSIZE);
  if (P2V(PHYSTOP) > (void*)DEVSPACE)
    panic("PHYSTOP too high");
  for(k = kmap; k < &kmap[NELEM(kmap)]; k++)</pre>
    mappages(pgdir, k->virt, k->phys_end - k->phys_start,
              (uint)k->phys_start, k->perm);
  return pgdir;
```

// kern text+rodata

### **ILLINOIS TECH**



#define EXTMEM 0x100000

#define KERNBASE 0x8000000

// bytes mapped by a page 4096

```
// Allocate page tables and physical memory to grow process
                                    // Start of extended memory
                                    // Top physical memory
#define PHYSTOP 0xE000000
                                                                              int
                                    // Other devices are at high addresses
#define DEVSPACE 0xFE000000
                                                                              allocuvm(pde_t *pgdir, uint oldsz, uint newsz)
                                                                              Ł
                                 // First kernel virtual address
                                                                                char *mem;
#define KERNLINK (KERNBASE+EXTMEM) // Address where kernel is linked
                                                                                uint a;
#define PGSIZE
                                                                                if(newsz >= KERNBASE)
#define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1))
                                                                                 return 0;
#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))
                                                                                if(newsz < oldsz)</pre>
#define V2P(a) (((uint) (a)) - KERNBASE)
                                                                                  return oldsz;
#define P2V(a) ((void *)((char *) (a)) + KERNBASE))
                                                                                a = PGROUNDUP(oldsz);
// This table defines the kernel's mappings, present in every process
                                                                                for(; a < newsz; a += PGSIZE){</pre>
static struct kmap {
                                                                                  mem = kalloc();
  void *virt;
                                                                                  memset(mem, 0, PGSIZE);
                                                                                  mappages(pgdir, (char*)a, PGSIZE, V2P(mem), PTE_W|PTE_U);
  uint phys_start;
  uint phys_end;
  int perm;
                                                                                return newsz;
} kmap[] = {
 { (void*)KERNBASE, 0,
                                   EXTMEM,
                                              PTE_W}, // I/O space
                                                      // kern text+rodata
 { (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0},
                                              PTE_W}, // kern data+memory
 { (void*)data,
                    V2P(data),
                                   PHYSTOP,
                                              PTE_W}, // more devices
 { (void*)DEVSPACE, DEVSPACE,
                                   0,
                                                                                                ILLINOIS TECH
};
                                                                                                                         College of Computing
```



#define EXTMEM 0x100000
#define PHYSTOP 0xE000000
#define DEVSPACE 0xFE000000

// Start of extended memory
// Top physical memory
// Other devices are at high addresses

#define KERNBASE 0x8000000 // First kernel virtual address
#define KERNLINK (KERNBASE+EXTMEM) // Address where kernel is linked

#define PGSIZE 4096 // bytes mapped by a page
#define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1))
#define PGROUNDDOWN(a) (((a)) & ~(PGSIZE-1))

```
#define V2P(a) (((uint) (a)) - KERNBASE)
#define P2V(a) ((void *)((char *) (a)) + KERNBASE))
```

```
// This table defines the kernel's mappings, present in every process
static struct kmap {
  void *virt;
  uint phys_start;
  uint phys_end;
  int perm;
} kmap[] = {
 { (void*)KERNBASE, 0,
                                   EXTMEM,
                                             PTE_W}, // I/O space
 { (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0}, // kern text+rodata
 { (void*)data,
                                             PTE_W}, // kern data+memory
                V2P(data),
                                  PHYSTOP,
 { (void*)DEVSPACE, DEVSPACE,
                                             PTE_W}, // more devices
                                  0,
};
```



### Beyond 32-bit address spaces

| Paging<br>Mode | PG in<br>CRO | PAE in<br>CR4 | LME in<br>IA32_EFER | Lin<br>Addr.<br>Width | Phys<br>Addr.<br>Width <sup>1</sup> | Page<br>Sizes                     | Supports<br>Execute-<br>Disable? | Supports<br>PCIDs and<br>protection<br>keys? |
|----------------|--------------|---------------|---------------------|-----------------------|-------------------------------------|-----------------------------------|----------------------------------|----------------------------------------------|
| None           | 0            | N/A           | N/A                 | 32                    | 32                                  | N/A                               | No                               | No                                           |
| 32-bit         | 1            | 0             | 0 <sup>2</sup>      | 32                    | Up to<br>40 <sup>3</sup>            | 4 KB<br>4 MB <sup>4</sup>         | No                               | No                                           |
| PAE            | 1            | 1             | 0                   | 32                    | Up to<br>52                         | 4 KB<br>2 MB                      | Yes <sup>5</sup>                 | No                                           |
| 4-level        | 1            | 1             | 1                   | 48                    | Up to<br>52                         | 4 KB<br>2 MB<br>1 GB <sup>6</sup> | Yes <sup>5</sup>                 | Yes <sup>7</sup>                             |

aka x86-64, per original AMD specification



## x86-64 (aka IA-32e) modes

- Long mode: 48-bit virtual addresses (256TB virtual address spaces)
  - 4-levels of paging structures
  - All but two segment registers are forced to a flat model, and no segment limit checking is performed
    - FS & GS segments can contain non-zero bases (useful for OS)
- Compatibility mode allows for 32-bit code to run unaltered
- Intel has started implementing 5-level paging to support 57-bit virtual addresses (as of Ice Lake)



## Long mode paging

- 48-bit virtual addresses with 4 levels of paging
  - Depending on paging structure entries, supports 4KB, 2MB, 1GB page sizes



### Long mode 4KB paging





### Long mode 2MB paging



### Linear Address



### Long mode 1GB paging





### Access control and metadata

- User/Supervisor and Read/Write flags in paging structure entries can be used to guard access
  - If U/S flag = 0 (supervisor), can only access page if CPL = 0
- Accessed and Dirty flags are also useful for kernel swapping policies

| Reserved <sup>2</sup> Address of PML4 tableIgnored $P P V V V V V V V V V V V V V V V V V V$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 6<br>3  | 6 6 6 5<br>2 1 0 9   | 5 5 5 5 5 5 5<br>8 7 6 5 4 3 2 | 5<br>1 | 1 <sup>1</sup> | M-1 | 333<br>210 | 2 2 2 2 2 2 2 2 2 2 2<br>9 8 7 6 5 4 3 2 1 | 2 1 1 1 1 1 1 1 1<br>0 9 8 7 6 5 4 3 | 1  1  1<br>2  1  0  9 | 8          | 76                 | 5 4              | 4 3                | 2 1              | 0        |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|----------------------|--------------------------------|--------|----------------|-----|------------|--------------------------------------------|--------------------------------------|-----------------------|------------|--------------------|------------------|--------------------|------------------|----------|
| D       Ignored       Rsvd.       Address of page-directory-pointer table       Ign.       Rsvd. b       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       D       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       C       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       A       D       D       D       D       D       D       D       D       A       D       D       D       D </td <td></td> <td></td> <td>Reserved<sup>2</sup></td> <td></td> <td></td> <td></td> <td></td> <td></td> <td colspan="5">P<br/>Ignored C</td> <td></td>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |         |                      | Reserved <sup>2</sup>          |        |                |     |            |                                            | P<br>Ignored C                       |                       |            |                    |                  |                    |                  |          |
| X       Prot.<br>Key <sup>4</sup> Ignored       Rsvd.       Address of<br>1GB page frame       Reserved       P<br>A       Ign.       G       1       D       A       D       V       R<br>V       V       R<br>V       1         X       Ignored       Rsvd.       Address of page directory       Ign.       Ign. <td></td> <td>I</td> <td>gnored</td> <td>Rsvd.</td> <td></td> <td></td> <td>Address</td> <td>s of page-directory-p</td> <td colspan="7">of page-directory-pointer table</td> <td>1</td>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |         | I                    | gnored                         | Rsvd.  |                |     | Address    | s of page-directory-p                      | of page-directory-pointer table      |                       |            |                    |                  |                    |                  | 1        |
| A Hote<br>B       Ignored       Rsvd.       Address of<br>IGB page frame       Reserved       A<br>T       Ign.       G<br>T       I<br>D       D       A<br>D       C<br>D       I<br>D       I                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Ignored |                      |                                |        |                |     |            |                                            |                                      |                       |            |                    |                  | <u>0</u>           |                  |          |
| n       Ignored       Rsvd.       Address of page directory       Ign.       Q       g       A       C       W       S       /       1         Ignored       Ignored       Ignored       Rsvd.       Address of 2MB page frame       Reserved       P       Ign.       G       1       D       A       C       W       V       N       1         X       Prot.       Ignored       Rsvd.       Address of 2MB page frame       Reserved       P       Ign.       G       1       D       A       C       W       V       R       1         X       Prot.       Ignored       Rsvd.       Address of page table       Ign.       Q       I       D       A       C       P       V       R       1         X       Prot.       Ignored       Rsvd.       Address of Address of A/R       D       D       P       V       R       1         X       Prot.       Ignored       D       A       C       P       V       R       1         X       Prot.       Ignored       D       A       C       P       V       R       1         X       Prot.       Ignored       D <td></td> <td></td> <td>Ignored</td> <td>Rsvd.</td> <td></td> <td></td> <td></td> <td>Reser</td> <td>ved</td> <td>&gt;<br/>A Ign.<br/>T</td> <td>G <u>1</u></td> <td>L D</td> <td>A<br/>C<br/>C</td> <td>P<br/>C<br/>W<br/>D T</td> <td>U R<br/>′S V</td> <td>1</td>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |         |                      | Ignored                        | Rsvd.  |                |     |            | Reser                                      | ved                                  | ><br>A Ign.<br>T      | G <u>1</u> | L D                | A<br>C<br>C      | P<br>C<br>W<br>D T | U R<br>′S V      | 1        |
| X       Prot.       Ignored       Rsvd.       Address of 2MB page frame       Reserved       P       Ign.       G       1       D       A       P       P       U       R       1         X       M       Ignored       Rsvd.       Address of page table       Ign.       G       1       D       A       P       P       U       R       1         X       D       Ignored       Rsvd.       Address of page table       Ign.       0       Ign       A       C       W       V       V       1         X       D       Ignored       Rsvd.       Address of page table       Ign.       0       Ign       A       C       W       V       V       1         Ignored       Rsvd.       Address of page table       Ign.       0       Ign       A       C       W       V       V       V       1         X       Prot.       Ignored       Rsvd.       Address of 4KB page frame       Ign.       Ign.       C       P       P       V       V       V       V       V       V       V       V       V       V       V       V       V       V       V       V       V                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | X<br>D  | I                    | gnored                         | Rsvd.  |                |     | ŀ          | Address of page dire                       | ddress of page directory             |                       |            |                    |                  |                    | U R<br>′S V      | 1        |
| X       Ignored       Rsvd.       Address of page table       Ign.       Ign       <                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |         | Ignored              |                                |        |                |     |            |                                            |                                      |                       |            |                    |                  |                    | <u>0</u>         |          |
| A       Ignored       Rsvd.       Address of page table       Ign.       O       G       A       C       W       1         Ignored       Ignored       Ignored       Ignored       Ignored       Ignored       O       Ignored       Ignore                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |         | -                    | Ignored                        | Rsvd.  |                |     |            |                                            |                                      | ><br>A Ign.<br>T      | G <u>1</u> | LD                 | A C              | P P<br>C W<br>D T  | U R<br>′S /<br>W |          |
| X Prot. Japarod Dovd Dovd Address of 4KB page frame                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | X<br>D  | X<br>D Ignored Rsvd. |                                |        |                |     |            | Address of page table                      |                                      |                       |            |                    |                  | P<br>C<br>W<br>D T | U R<br>′S V      | 1        |
| A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A  =  A |         |                      |                                |        |                |     |            |                                            |                                      |                       |            |                    |                  | <u>0</u>           |                  |          |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | X<br>D  |                      | Ignored                        | Rsvd.  |                |     | A          | lgn.                                       |                                      |                       | A C        | P<br>C<br>W<br>D T | U R<br>′S V<br>W | 1                  |                  |          |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |         |                      |                                |        |                |     |            | Ignored                                    |                                      |                       |            |                    |                  |                    |                  | <u>0</u> |





### Linux VM features

- Page cache and Sharing
- Swap cache
- Copy-on-write optimization
- Page allocation: buddy system
- Kernel internal memory management: slab allocator





# Page cache and Sharing

- address space
  - Page faults cause data to be loaded, a page at a time
  - All file data loaded this way have entries in the page cache, which the kernel consults before going to disk
    - If multiple processes access the same files, the kernel can share cached pages between them (potentially at different virtual addresses)
      - **Dirty bit** needed to ensure that page isn't modified

- When executing a program (or loading shared libraries, etc.), the source file is not immediately loaded, but rather linked into the process's virtual







### Swap cache

- Dirty pages are swapped out (to save their contents) when low on memory
  - Unmodified pages can just be loaded from the page cache!
- Swap cache keeps track of pages that have been written to swap
  - If a page was previously swapped out and wasn't modified after being swapped back in, can simply discard it
    - Helpful optimization for when system is heavily swapping (thrashing)



# Copy-on-write (COW)

- "Clone" operation is quite common (e.g., used when fork-ing a process)
  - But if carried out literally duplicating entire memory image is incredibly expensive (and likely unnecessary)
- At clone time, no data is actually copied; simply replicate paging structures and mark pages as read-only
  - Page faults that occur on write accesses trigger copy operation



### Page allocation

- fragmentation
  - space using paging structures
  - huge page (e.g., 4KB vs. 4MB)
    - Can greatly improve TLB effectiveness!

- Because pages are all the same size, theoretically we have no external

- Can allocate first free page we find and map it into any virtual address

- But recall: large blocks of contiguous pages can be mapped as a single

- Especially desirable given many levels of paging structures

- Also needed for I/O device direct memory access (more on this later)



### Buddy system allocator

- - E.g., list #0 = 1 page blocks, list #1 = 2 page blocks, list #2 = 4 page blocks, list #3 = 8 page blocks, etc.
  - When allocating a block, keep splitting in half if possible
  - When freeing a block, keep merging (doubling) if possible

- Linux uses a "buddy system" allocator to search for blocks of free pages - Idea: maintain separate lists of free page blocks, with sizes = powers of 2



# Buddy system pros/cons

- Pros:
  - Fast allocation search is easy
  - Able to find contiguous blocks
    - Good for huge pages
    - Can simplify page table updates

- Cons:
- Small vs. Large blocks creates external fragmentation
  - 2<sup>n</sup> block sizes can result in significant internal fragmentation
    - Compromise: speed vs. efficiency





### Kernel internal allocation

- Kernel frequently needs to free/allocate internal data structures
  - e.g., PCB entries, VM structures, file/inode structures
  - Fixed size, similarly initialized
- Buddy allocator is not ideal too much internal fragmentation!
- Linux uses a **slab allocator** to allocate & free internal data structures



### Slab allocator

- Built on top of the page buddy allocator
- Idea: allocate large blocks using buddy allocator, and carve them up into multiple data structure entries
  - Use the first one available, and leave partially initialized when freed
  - Effectively build dedicated caches for different data types
- Mitigates internal fragmentation due to buddy allocator

