Memory Management

It's important for an SA to know exactly how memory works and is accessed on a system. From physical hardware all the way down to a program requesting a few bytes of memory for whatever thing it needs. In the spirit of that idea, we'll (somewhat quickly) tackle each "big idea" in memory management today. There's plenty of interesting history and technical details that we'll skim, but the idea is to hammer the main concepts that will actually help you discover, understand, and diagnose problems you encounter. # What is physical RAM, really? RAM simply stands for Random Access Memory. So named, because, as opposed to SAM (Sequential Access memory) such as tapes, data inside of a RAM chip can be accessed in any order. The **most common** type of RAM in a computer is called Dynamic RAM, or DRAM. I'll continue on, here, assuming this type, but once you start getting to the OS and software level the type of RAM stops mattering much at all. The more important part to remember for these systems is Double Data Rate (DDR). The original DDR RAM was double the speed / data rate of the previous Single Data Rate (SDR) RAM. Then we had DDR2, which doubled the speed from DDR, DDR3 which doubled again, and now DDR4 and DDR5. RAM is quite fast these days, but as we'll discuss below it's still far, far slower than the CPU. > [!info]- Lies > - **DDR**: 200-400 MT/s > - **DDR2**: 400-1,066 MT/s > - **DDR3**: 800-2,133 MT/s > - **DDR4**: 1,600-3,200 MT/s > - **DDR5**: 3,200-8,400 MT/s > > So, not quite double each generation, but close. Modern RAM uses something called a **M**etal-**O**xide **S**emiconductor - a type of **F**ield-**E**ffect **T**ransistor (or **MOSFET**) - to store individual bits inside its chips. The physical characteristics of a MOSFET mean that it requires a constant power source to keep the bits where they need to be. Once removed from power, MOSFETs rapidly lose their charge and they can no longer hold information. This means that all MOSFET-based memory (all RAM, realistically) is volatile memory and completely powering off a computer will often clear out the bits stored in the physical chips of RAM. ![](6bf2e8ec85589269f7ee5e10be808ec8_MD5.png) # Memory mapping, simplified We're going to dive right into how an OS (Operating System) gives programs running under it some bytes located in physical memory. Because of some clever decisions made by clever software developers throughout modern history, the OS doesn't simply hand a program a location of some bytes it can use on a physical stick of RAM. Instead, it does something called memory mapping which turns the RAM's physical address into a virtual memory address that can be arbitrarily shuffled around if needed. Memory management, as it turns out, is a bit of a complex system full of history and landmines. It's actually very interesting, but I'll be glossing over a lot and lying here and there to get to a few points that are more useful to an SA. > [!info]- Lies > The OS does not directly map physical memory to virtual memory. It was, originally, a chipset on the motherboard called an Input-Output Memory Management Module (IOMMU) which did the task. Then, the functionality was added directly to the CPU on the Platform Controller Hub (PCH) which introduced the concepts of a northbridge and southbridge. Finally, the PCH was fully integrated to the main "body" of the CPU and removed altogether. This last step happened decades ago so you'll be unlikely to encounter a CPU with a PCH unless you're really into old tech. > > These days, MMUs perform extra tasks like handling the L1, L2, and L3 caches on the processor, integrating memory protection on-die, and sometimes handling bussing bytes around on the motherboard. > > So, memory mapping is still a physical component of the CPU, but the kernel has a lot of control over that process and direct access to that controller. The MMU handles translating physical memory to pages, but the kernel turns those pages into virtual memory and decides when to create those pages. So, there's a part of the CPU die called an MMU (or sometimes IOMMU). The MMU takes large chunks of the physical memory (called pages) and provides those blocks (again, pages) to the OS. When the OS updates those pages with new data, the MMU maps that data to physical addresses on the RAM itself. The process of storing and retrieving data from the RAM sticks is, compared to the processor's built-in memory storage (called L1, L2, and L3 cache), *extremely* [slow](https://www.reddit.com/r/buildapc/comments/bu0zp3/comment/ep6u5ot/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button). The processor is hundreds of times faster than any motherboard signal and the L*N* caches on the processor are very small. The trick for the OS (and its programs), then, is to manage this data in such a way as to utilize physical RAM (and especially disk) as little as possible. We're not quite done yet, though. The OS now has "pages" of memory, but these are huge blocks of effectively raw bytes in memory that the OS needs to now divvy out to its running processes. And, as mentioned earlier, doing the simple thing of just giving its programs direct access to these pages is both risky and inefficient. > [!info]- Hugepages > Those "huge blocks" are typically 4kb (4,096 bytes) in size. Through the use of a system called [hugepages](https://kerneltalks.com/services/what-is-huge-pages-in-linux/), however, they can be up to 1GB (!!) large. So, what does the OS do? It creates what's called "virtual memory" which chunks out these pages into various sizes as-needed and hands them out to programs on request. The OS can then 1) protect these virtual memory regions more effectively and 2) move these regions from page to page without worrying about affecting the underlying program. The program, then, just sees a large, contiguous block of memory whereas the bytes in this block are stored all over the physical RAM. ![](888fd74268c2159daa7329898fe77849_MD5.png) This process of creating virtual memory means we can do neat things like the Copy-on-Write (CoW) memory management system which is standard in the modern Linux kernel. > [!info]- CoW > When a process `fork()`s itself, it is expected that the forked copy of the process will have a copy of the parent process's memory, often called "memory sharing" or "shared memory" - but it is *not* the same concept as "shared memory" in `/dev/shm`. > > Older UNIX systems (think: Bell Labs) copied the memory of the parent process into a region the child process could access. Modern kernels will point the child process at the parent process's memory region, but when the child (or parent) process wants to modify the parent process's memory somewhere, a copy of those bytes are created for the child process. > > To clarify: If the parent modifies its own memory, a copy is made and the child sees the **old** bytes. If a child modifies the parent's memory, a copy is made and the child sees the **new** bytes but the parent's memory region remains unaffected. > > The end result of both systems is that the child process has "access" to the parent's memory but can only modify that memory for itself. The newer CoW is much, much more efficient, however. # Swap space Swap is one of those things that many, many SAs get wrong. There's a lot of old and just plain incorrect information about how Swap or `vm.swappiness` works, and those misunderstandings cause real performance impacts on real machines. There's an [excellent write-up by Chris Down, one of the developers of cgroup v2, on the common misconceptions of SWAP](https://chrisdown.name/2018/01/02/in-defence-of-swap.html). It's worth the read and even has a tl;dr, but I'll summarize and simplify here. As mentioned earlier, pages are handed to the OS by the MMU and the OS can then do with those pages as it pleases. Under the Linux kernel, there are different "types" of pages. There's pages related to caching metadata about files, pages for holding code for your running programs, pages for kernel-specific operations, and more. By far the most common page type, however, is the "anonymous" page which holds memory used by programs themselves. Anonymous pages are called anonymous because they have no *backing* and thus that memory is *unreclaimable*. But, what does it mean to have a *backing* for a page, and what is *unreclaimable* memory? For most "types" of pages, the OS automatically *backs* them by storing a copy somewhere on more permanent storage, like the disk. This allows for placing non-critical or not-recently-used pages on a slower, longer-term storage medium and freeing up valuable RAM real-estate for things that need faster access. This process of taking pages off of RAM and putting them onto disk instead is called *reclaiming* and those pages are thought to be *reclaimable*. > [!info]- Example > The most obvious result of this reclamation system is to allow for hybernation, which means a much faster boot time because most of the OS-specific stuff can be loaded back into RAM quickly instead of being re-initialized. All modern Windows OSes (from 10 onwards) do this when you power them off. Now, anonymous pages have no *backing* like most of the other types. This means that their contents are purely in RAM and there are no other copies anywhere. If the OS decides that a program hasn't accessed something in a while (or code for destructors is infrequently called) it has no option to put that data onto disk to free up physical RAM for other things. This is where Swap space comes in. It's a section of disk that the OS can use to *back* anonymous pages. Adding Swap to a system allows the OS to be able to move much more data around and free up valuable RAM for data that needs to move fast. This gives the kernel much more of a fighting chance to make the overall system as fast and efficient as it can. So, as Chris puts it: > [!quote] Swap is primarily a mechanism for equality of reclamation, not for emergency "extra memory". Swap is not what makes your application slow – entering overall memory contention is what makes your application slow. > [!info]- Page faults > This system of reclamation combines with the idea of the MMU's pages to introduce a new concept: page faults. > > When something happens, like the OS requesting a page that the MMU doesn't see in physical memory or even just a program requesting more memory causing the OS to load a page that doesn't exist yet, the MMU creates a system interrupt called a "page fault" which tells the OS to do something. > > In the case of the former (called a minor fault, where the page was likely just moved from memory to disk), the OS simply requests a page at that location and loads the page from disk into that section of RAM. For the latter (a major fault, where the page never existed in the first place), the OS needs to request a page to be created at that location and hand it to the program. > > Hardware interrupts are similar to software interrupts: they're basically communicating information to the kernel/program by biting ankles in a specific way that the kernel or program now needs to interpret before it can do anything else. As you can imagine, it's not the most efficient process on the planet. > > Either way, having **many** page faults indicates a problem, and ignoring your problems often leads to bigger problems. And, finally, some very quick facts about about `vm.swappiness`: - It's a `sysctl` setting (can be set in `/etc/sysctl.d`) - It's a number from 0 to 200 (inclusive) - yes, 200 - It *biases* reclamation between anonymous pages and other types of pages - `vm.swappiness = 50` would give you a ratio of 50:150 (1:3) for anonymous page reclamation and "other" page reclamation, respectively - A value of 100 would give you an even 50/50 chance of reclaiming anonymous pages and all other (reclaimable) pages - `vm.swappiness = 0` is a [special case](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=fe35004fbf9eaf67482b074a2e032abb9c89b1dd) which essentially disables anonymous page reclaiming, **except** it uses a different system to the standard bias system - `vm.swappiness = 1` is the lowest you can go while still keeping the biasing system - The default value is usually somewhere around 60 # Software & RAM When a program is run (ELF for Linux, PE for Windows), parts of the executable (called segments) are loaded into various regions of memory. Machine code, data, etc are all taken from the executable's file on disk and loaded into RAM before the executable actually runs. When a program requests memory during runtime (eg. `malloc` or declaring a new variable) the OS either finds space in an existing page or raises a major page fault and tells the MMU to create a new page in RAM for the newly-allocated block of memory. > [!info]- Virtual Memory > Remember that the OS only gives the program *virtual memory*, a slice of an existing *page*, which is itself a slice of RAM. Depending on the type of memory allocation requested, the OS will put the newly-allocated memory in the program's heap or stack. ## Stack Stack memory is often *very* temporary and is used in function calls. It is LIFO (Last In, First Out) and uses a stack pointer in a relatively small contiguous set of memory to allocate and deallocate very small portions of its memory. This concept of a "stack pointer" is what makes the stack extremely fast, but means it is limited to LIFO. Basically, think of the stack like a very tall "stack" (aha!) of small sections of memory. The program keeps adding to the top of the stack, expecting to eventually get to the bottom of it by constantly removing the topmost section. Except that, instead of doing all of the work of adding to and removing from the top of this stack, the program keeps a little slider next to the stack that points to the current working location. ![](2486be1e66eed0688ad63ecdc1a75ddd_MD5.png) The stack is used only to store scoped variables, function parameters, and return addresses (to return to the previous function when done). Ben Eater has a [fantastic video on how the stack works](https://www.youtube.com/watch?v=xBjQVxVxOxc) if interested. In the example below, the variable `a` and the parameter `n` are both stored on the stack, as is the return addresses of the previous function that called the `test` function, and any functions that called those functions all the way to the beginning of the program's entry point. ```c void test(int n) { int a = 10; return; } ``` A "stack overflow", then, is an exception raised when the program's stack memory has filled up - the stack pointer has run through the entire stack memory region. Stack is non-expandable memory so it can, eventually, become full. Because each item on the stack contains very little data it can hold a lot of items, but it's not infinite and something like an endless recursive function can demolish it. ## Heap Each program has access to heap memory, which is both much larger than the stack and is "infinitely" expandable. Well, until you hit the system's total physical memory available. Then, you run into fun things like the kernel's OOM killer, detailed below. Unlike the stack, the heap is designed to persist the entire lifetime of the program. It's used to hold the actual contents of complex variables (eg. variables holding objects) and anything held in a `malloc`'d region. In the example below, the pointer to the `data` variable is stored on the stack, and the actual value of the `malloc`'d region of memory that the `data` variable represents is stored in the program's heap. ```c int *data = (int *) malloc(sizeof(int) * 50); // Allocate heap if (data == NULL) { // Allocation failure } // Data now points to somewhere in heap free(data); // Deallocate heap ``` Here's the main differences between the stack and the heap: | Aspect | Stack | Heap | | ----------------------- | --------------------------------- | ------------------------------------ | | Structure | LIFO | Free-forming | | Lifespan | Function | Program | | Size limit | Very small | Total system memory | | Use | Local variables, return addresses | Dynamic (objects, arrays, raw bytes) | | Allocation/Deallocation | Automatic | Manual | And, of course, we can't go through allocation and deallocation of memory without mentioning *memory leaks* - which are memory regions that weren't properly deallocated; meaning they take up space but nothing points at them, so they sit there being useless until the program ends or is killed. These happen, as mentioned, when *heap* memory is allocated (eg. `malloc`) without being deallocated later (eg. `free`). Stack memory is always allocated and deallocated automatically by the OS, so "memory leaks" in the stack aren't really possible. The most similar problem on a stack would be a "stack overflow", as explained above. > [!info]- Garbage Collection > GC (Garbage Collection) is a memory management mechanism where the language runtime (eg. JVM or Java Virtual Machine for Java) automatically deallocates memory that is no longer referenced by the program, preventing that memory region from being "leaked". Java, C#, and Python utilize garbage collection extensively. > > This process helps reduce issues like memory leaks by managing heap memory automatically. A common argument is that a GC can introduce latency as it finds and cleans unreferenced objects, though in practice that argument has been outdated for years (see: G1GC, ZGC, etc) and code in GC languages is often faster than similar code in C++ due to a VM's internal management and chunking of memory which is often much faster than requesting new memory from the OS via raw `malloc` calls. > > ## Direct memory allocation > In languages like Java, direct memory allocation (eg. `ByteBuffer.allocateDirect()`) allocates memory outside the traditional VM heap for native I/O operations. These *heap* memory allocations are managed manually and can bypass the Garbage Collector entirely. They're often used for communicating with native libraries, although allocation of very large chunks of memory outside the VM can be much faster than a similar operation within the VM. > > The downside, of course, is that these implementations can lead to the same problems as non-GC languages: memory leaks. ## Buffer Overflows A **buffer overflow** occurs when a program writes more data to a buffer (a contiguous block of memory) than it can hold. The excess data can overwrite adjacent memory, leading to unpredictable behavior and vulnerabilities. Programs use buffers to handle data like strings or chunked file data. When a program does not check the size of incoming data before writing it to a buffer, an overflow can occur. Here's a simplified example in C: ```c #include <stdio.h> #include <string.h> void vulnerableFunction(char *input) { char buffer[10]; // small buffer strcpy(buffer, input); // no bounds checking printf("Buffer contents: %s\n", buffer); } int main() { char largeInput[] = "ThisInputIsWayTooLargeForTheBuffer"; vulnerableFunction(largeInput); return 0; } ``` In this example, `strcpy` copies data from `input` into `buffer`, but since `buffer` can only hold 10 characters, anything beyond that will overflow into adjacent memory. But *why* can this lead to exploitation? What does an attacker actually take advantage of, here? In the above example, `buffer` is allocated on the *stack*, which, as explained above, is the same section of memory that contains the previous (calling) function's return address. Here's what the stack would look like, in this case: | Local variables (`buffer`) | | | -------------------------- | ------------------------------ | | Padding (overflow area) | <- Overflow corrupts this area | | Return Address | <- Overwritten | | Previous Stack Frame | | | ... | | An attacker would take advantage of the size of the buffer and any padding around it to inject shellcode (malicious code) directly onto the stack, and then overwrite the return address (again, directly on the stack) to point to this new shellcode. So, a buffer overflow attack would do something like this: | Local variables (`buffer`) | | | -------------------------- | ---------------------- | | Shellcode (malicious) | | | Return Address | <- Points to shellcode | | Previous Stack Frame | | | ... | | In some cases this will cause the program to crash, but in others the overflow exploit and shellcode can be crafted in such a way as to allow the stack to continue without interruption. # OOM killer So, what happens when a program (or set of programs) has used *all* of the available physical RAM and the system has no more Swap space left? We'll talk about the Out-Of-Memory (OOM) killer, which is a system in the Linux kernel designed to, as a last resort, start killing off processes that it deems necessary for its own survival. When a process is created, it is given an OOM score, ranging from 0 to 1000, which is constantly updated by the kernel. You can see this score for each process by simply viewing the `/proc/<id>/oom_score` file associated with the process. The process with the highest score, then, is killed first. That death is followed by the second-highest, and so on. For example, a `firefox` process running as PID (process ID) 7159 would have the file `/proc/7159/oom_score` with the file contents `677`, which is the process's OOM score. > [!info]- oom_score_adj > You can adjust this score by simply putting your "score adjustment" number into `/proc/<id>/oom_score_adj` > > eg. to reduce PID 7159's score of 677 by 200 points (resulting in a total score of 477), you would run `echo -200 > /proc/7159/oom_score_adj` > > You can put a range of values from -1000 to 1000 into the `oom_score_adj` file. A resulting `oom_score` of 0 effectively disables OOM killing for that process. The exact heuristics of this score are a bit more complex, but the basic idea is that the kernel doesn't want to kill important things or things that don't use a lot of memory. ![](869e9ac1505164a64f85caf078d12a6a_MD5.webp) > [!info]- Lies > It isn't quite as simple as "highest score dies first" but for all intents and purposes that is effectively the case. > > From `man procfs`: > > The badness heuristic assigns a value to each candidate task ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The units are roughly a proportion along that range of allowed memory the process may allocate from, based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. > > > > There is an additional factor included in the badness score: root processes are given 3% extra memory over other tasks. > > Note that the man page references a "badness" score, which is the old name for "OOM score". > > Essentially the kernel over-allocates memory to all of its processes, and tracks how much memory each process uses. If you have a computer with 4GB physical RAM, it might allocate a total of 5GB to all of its running processes, expecting most processes will not use 100% of their allocated memory. If the system ends up in trouble and the OOM killer is required, it will look at processes nearing or at their total memory allocation first. > > Usually, these processes are the "problem children" and are consuming way too much or simply leaking memory, so the system works as intended. Usually. # NUMA: Non-Uniform Memory Access # Help! Proxmox shows my VM consuming more RAM than it really is!