Lab 5: Performance Benchmarking and Overhead

In this lab, we’ll determine whether the entire class has been built on a foundation of lies or not. There have been several claims made about the performance implications of system calls, but could they really be that significant? Let’s find out.

To complete this lab, you will need to take performance measurements by retrieving the current UNIX timestamp from the hardware real-time clock (RTC). You’ll create a new user space program called benchmark that executes another program, determines how long it ran for, and counts the number of system calls it issued. You can take inspiration from the previous tracer program (you do not have to complete Lab 4 to do this assignment, but may benefit from reading it if you haven’t already).

The Setup

One of the major performance issues we discussed in Lab 2 was the large amount of system calls caused by fgets reading character-by-character. To get an estimate of how much of a problem this is when it comes to execution speed, we can store a large file in our OS file system and then see how fast it can be read and printed out with either fgets or getline. As an initial point of comparison, we can also find out how long it takes to run the cat utility on the file (note that cat does not treat newline characters as a special case).

One “large” file that will work for this experiment is a 24-KB excerpt of H. G. Wells' The Time Machine. Download it to your OS directory with wget or curl and then add it to the file system image by editing your Makefile. Look for the recipe that builds fs.img and use README.md as a point of reference for adding the new file. Hint: the first line of the recipe lists its dependencies, and the second line tells make how to build it. You need to update both lines.

Once you’ve successfully copied your reading material into the file system image, start up your OS and try running cat on the file. It will take some time to read and print out on the console.

Tracking System Call Counts

Based on the previous lab, you should be able to easily add a counter to the process struct that you will increment every time a system call is issued by the process. To make your life easier, kernel/syscall.c is a good place to increment the counter. Just like in the previous lab, you will also need a way to access this information later. However, getting the system call count might not be so easy – you want the count when the process finishes, but then isn’t it already dead, gone, kaput, expired, deceased, departed… no more?

To work around this issue, we can ~~blatantly steal~~ borrow from Linux and other UNIX-like operating systems. Take a look at man 2 wait and you’ll find something interesting: there is a version of wait that returns resource utilization statistics! This approach makes sense; we want the statistics when the process is finished, and the best way to get that information is when the parent process is calling wait.

To make this happen, add a wait2 system call that waits for a child process to complete and also returns both its (1) exit status and (2) system call count. Model wait2 after the original wait system call. In fact, you should be able to completely replace the old implementation of wait with a call to your new system call: return wait2(addr, 0);. (Since the second parameter is 0, the system call count does not get returned, making it behave exactly like the old wait).

The new concept you’ll learn here is copying information from kernel space to user space. We previously relied on return values to do this, but now we need to be able to return information to a memory address that exists in user space: when you pass in pointers to memory locations to store the exit status and system call count for a process, the kernel can’t simply access that memory directly. Check out the copyout function in kernel/vm.c – this is what you’ll need to get the information back to user space. Use the original wait’s call to copyout as a model for what you need to do.

Collecting Performance Measurements

Given that you can already retrieve a UNIX timestamp with nanosecond accuracy, this part will be easy. If you want to determine how long something takes, simply record when it started, when it ended, and calculate the difference between the two:

uint64 start = time();
thing_one();
thing_two();
etc();
uint64 end = time();
uint64 elapsed = end - start;

If you converted your timestamp into seconds at the system call level, you will want to refactor it so user space gets the full-resolution timestamp (not converted to seconds in advance).

Building the Benchmark Utility

benchmark will be loosely inspired by tracer from the previous lab. Have the program take command line arguments that determine what to run, and execute them as a child process. In the parent process, collect the performance measurements (child run duration) and report its system call count.

/benchmark cat time-machine.txt

... gigantic amounts of text print ...

He put down his glass, and walked towards the staircase door.
------------------
Benchmark Complete
Time Elapsed: 4982 ms
System Calls: 100

Making it go fast

Now that you can benchmark programs, it’s time to build a new, better, faster version of fgets. One that doesn’t use as many system calls. Here’s a program called catlines.c that uses fgets to read a file line by line:

#include "kernel/fcntl.h"
#include "kernel/types.h"
#include "kernel/stat.h"
#include "user/user.h"

int
main(int argc, char *argv[])
{
  if (argc <= 1) {
    fprintf(2, "Usage: %s filename\n", argv[0]);
    return 1;
  }

  int fd = open(argv[1], O_RDONLY);
  char buf[128];
  int line_count = 0;
  while (fgets(fd, buf, 128) > 0 ) {
    printf("Line %d: %s", line_count++, buf);
  }

  return 0;
}

Build a similar program but swap the call to fgets with an optimized function that you design. (You don’t need to add this function to ulib.c – you can leave it in the test program). Benchmark the baseline (catlines.c) and compare with subsequent versions of your optimized program. Be sure that your optimized program is correct, i.e., produces the same output! Keep track of the run times and system call counts in a text file like this (benchmark.txt):

@ Time,Syscalls
base  68.9  382
opt1  22.4  88
opt2  13.6  69
opt3  4.20  32

Then you can produce a simple visualization with termgraph. If termgraph isn’t already installed, run python3 -m pip install termgraph on gojira. Then, use it like this:

$ termgraph benchmark.txt --color {blue,red}

▇ Time  ▇ Syscalls


base    : ▇▇▇▇▇▇▇▇▇ 68.90
          ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 382.00
opt1    : ▇▇ 22.40
          ▇▇▇▇▇▇▇▇▇▇▇ 88.00
opt2    : ▇ 13.60
          ▇▇▇▇▇▇▇▇▇ 69.00
opt3    : ▏ 4.20
          ▇▇▇▇ 32.00

(The colors are not shown in the example above)

This will help you track whether the changes you’ve made are making a difference or not. It’s okay if each new version of your program isn’t necessarily faster, it’s just part of the process.

Once you’ve built something that’s faster and benchmarked it, you’re done.

Grading and Submission

Once you are finished, check your changes into your OS repo. Then have a member of the course staff take a look at your lab to check it.

To receive 50% credit:

Implement the benchmark utility with the ability to track process run time.

To receive 85% credit:

Complete all previous requirements
Implement the wait2 system call and track the total number of system calls in benchmark

To receive full credit for this lab:

Complete all previous requirements
Produce a new version of catlines.c that is faster than the baseline by optimizing system calls via fgets
Check in your benchmark results to docs/fgets-bench.txt

To receive 105% credit for this lab:

Complete all previous requirements
Produce the fastest optimized version of fgets. You can post your results on CampusWire if you think they’re particularly good. We may award a couple winners if it’s warranted.