Memory Allocator with FIFO Free List

In this lab, we’ll build our own memory allocator that serves as an alternative to the default malloc family of functions. If we take a look at the man pages for malloc et al., we’ll find that we need to implement four functions:

void *malloc(size_t size);
void free(void *ptr);
void *calloc(size_t nmemb, size_t size);
void *realloc(void *ptr, size_t size);

However, our memory allocator cannot create memory out of thin air: it needs to request memory from the Operating System first. To do this, we’ll use the mmap system call. mmap works very similarly to malloc, but requires more arguments to set up. Additionally, its counterpart to release memory, munmap, requires that we pass in the size of the block of memory to be released. (And as you’ll recall, all free needs is a pointer to the block that is going to be freed).

To deal with these implementation differences, we’ll prefix each allocation with a struct that contains some metadata about the allocations. One such piece of metadata will be the allocation size, which we’ll pass to munmap when we want to release memory. The other information we need to maintain is whether or not the block is free – if we immediately release memory back to the OS every time free is called, our allocator won’t be very efficient. System calls take time to execute, so we want to reduce communication with the OS as much as we can. Given this, we’ll integrate a linked list that tracks all the free blocks that haven’t been released back to the OS yet.

This means our allocations will look something like this:

[Header][Data] --> [Header][Data] --> [Header][Data] --> etc.

You are given the basic skeleton of a mostly-functioning memory allocator and must enhance it to support:

  1. A free list (linked list of memory blocks)
    • When an allocation is freed, it is added to the head of the list rather than unmapped immediately
  2. Block reuse capability
    • When servicing a malloc request, you should first check the free list for a viable block before mmap()ing a new one
    • Blocks should be removed from the end or tail of the list. Since additions happen at the head and removals happen at the tail, we are effectively creating a FIFO cache of free blocks.
  3. Threshold-based memory release
    • If a large number of blocks have been freed, you should release memory back to the OS with munmap. We define a “large number” with a configurable threshold.
    • By default, the threshold is 100 blocks. If the free list contains more than 100 blocks, then you should unmap the oldest allocation on the free list to make room for the most recently freed block.
    • The user may supply their own threshold with an environment variable, ALLOC_THRESH. You can use the getenv function to get the value of this environment variable, and if it is set then you should use the provided threshold instead of the default.

Look through the starter code to familiarize yourself with memory allocation basics. To download it directly to your VM, you can use the wget command, and then extract the gzipped tar file with tar xvf allocator-starter.tar.gz.

To build the allocator, you’ll compile like so:

cc allocator.c -Wall -fPIC -shared -o allocator.so 

And then to actually use it:

LD_PRELOAD=$(pwd)/allocator.so ls

(In this example, we’re running ls with our allocator).

Implementation Hints

Instrumenting Your Code

Use the TRACE macro to evaluate whether your allocator is functioning properly; the events you must trace are:

You can see an example of how this might look in the test run below.

Testing your allocator

You’ll need to add a bit of functionality for debugging purposes: a malloc_name(size, char *name) function that allows you to make named allocations (store the name in the header struct) and a print_memory function that prints the current state of the linked list, including memory locations, names, and sizes. You can use block names to confirm that allocations are ordered properly.

Once things are looking reasonable, you should be able to run the find command with your memory allocator on a large directory. For example, run:

ALLOC_THRESH=10 LD_PRELOAD=$(pwd)/allocator.so find /

To execute find over the entire file system tree. You may want to compile your allocator with logging and tracing turned off first so it finishes faster:

cc allocator.c -shared -fPIC -DLOGGER=0 -DTRACE_ON=0 -o allocator.so

Next, you can use this program as a test case. Compile your allocator with -DTRACE_ON=1 and then run it:

$ cc allocator-test.c -o allocator-test
$ ALLOC_THRESH=5 LD_PRELOAD=$(pwd)/allocator.so ./allocator-test
[TRACE] malloc(): Allocated block [0x7f2516018000]: 25 bytes
[TRACE] malloc(): Allocated block [0x7f2516017000]: 27 bytes
[TRACE] malloc(): Allocated block [0x7f2516016000]: 29 bytes
[TRACE] malloc(): Allocated block [0x7f2516015000]: 31 bytes
[TRACE] malloc(): Allocated block [0x7f2516014000]: 33 bytes
[TRACE] malloc(): Allocated block [0x7f2516013000]: 35 bytes
[TRACE] free(): Cached free block [0x7f2516018000]: 25 bytes
[TRACE] free(): Cached free block [0x7f2516017000]: 27 bytes
[TRACE] free(): Cached free block [0x7f2516016000]: 29 bytes
[TRACE] free(): Cached free block [0x7f2516015000]: 31 bytes
[TRACE] free(): Cached free block [0x7f2516014000]: 33 bytes
[TRACE] free(): Unmapped block -- [0x7f2516013000]: 35 bytes
[TRACE] malloc(): Reused block -- [0x7f2516018000]: 25 bytes
[TRACE] malloc(): Reused block -- [0x7f2516016000]: 29 bytes
[TRACE] malloc(): Allocated block [0x7f2516013000]: 36 bytes
[TRACE] realloc(): Unchanged ---- [0x7f2516013000]: 36 bytes
[TRACE] malloc(): Allocated block [0x7f2516012000]: 1024 bytes
[TRACE] free(): Cached free block [0x7f2516013000]: 36 bytes
[TRACE] realloc(): Resized block  [0x7f2516013000]: 36 bytes -> 1024 bytes
[TRACE] malloc(): Allocated block [0x7f2516011000]: 1048 bytes
X is: 99
[TRACE] malloc(): Reused block -- [0x7f2516017000]: 27 bytes
[TRACE] free(): Cached free block [0x7f2516018000]: 25 bytes
[TRACE] free(): Cached free block [0x7f2516016000]: 29 bytes
[TRACE] free(): Unmapped block -- [0x7f2516012000]: 1024 bytes
[TRACE] free(): Unmapped block -- [0x7f2516017000]: 27 bytes

Benchmarking

The whole premise of this assignment is that less system calls will mean less overhead, and therefore give us better performance. But you shouldn’t trust any theory without testing it first. We’ll collect some empirical data on just how effective our free list is.

First, install the time command (we’re not going to use the one that is already built into your shell):

$ sudo pacman -Syu time

Now that we have the time command installed, you can measure performance with the following:

$ ALLOC_THRESH=100 LD_PRELOAD=$(pwd)/allocator.so /usr/bin/time find / > /dev/null

NOTE: compile first without logging and tracing turned on!

You’ll get output that looks something like:

0.96user 1.56system 0:02.58elapsed 97%CPU (0avgtext+0avgdata 83532maxresident)
128inputs+0outputs (0major+200778minor)pagefaults 0swaps

So, this took 2.58 seconds to run. You can modify the value of ALLOC_THRESH (including setting it to 0 to effectively disable caching) and measure how long each run takes. We need to collect data for a range of thresholds, so the following shell script is a good starting point. It will output the total elapsed time to a file named (threshold).out for each threshold, allowing us to find the best run time.

#!/usr/bin/env bash

rm -f *.out
echo "Starting benchmark"
for (( i = 0; i <= 40; i += 10 )); do
    export ALLOC_THRESH=$i
    output_file=$(printf "%03d.out" "${i}")
    echo -n "${i}..."
    LD_PRELOAD=$(pwd)/allocator.so /usr/bin/time -q -f "%e" \
        -o "${output_file}" find / &> /dev/null
    sed -i "1s/^/${i}\t/" "${output_file}" # Add threshold to beginning of file
done
echo 'done!'

echo 'Combining outputs into benchmark.out...'
cat *.out > benchmark.out

Put the script in a file, such as run.sh, make it executable with chmod +x run.sh, and run it. You’ll notice that the script only runs through the first few thresholds, so you should modify it to do a more exhaustive set of benchmarks.

After doing this, we can find the absolute best run time with:

sort -n -k 2 benchmark.out | head -n1

But absolute best may not be exactly what we want. What are the tradeoffs here?

Finally, to get fancier, you can plot this in your terminal using termgraph:

$ sudo pacman -Sy python-pip
$ python3 -m pip install termgraph
$ ~/.local/bin/termgraph benchmark.out

0  : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.93 
10 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.85 
20 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.61 
30 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.51 
40 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.41 
50 : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.37 

What to Turn In

Check the following into your lab repo:

Grading and Submission

To receive 80% credit:

To receive 95% credit:

To receive 100% credit: