Project 1: Elastic Array & Disk Usage Analyzer (v1.1)

Starter repository on GitHub:

As storage densities continue to increase, so too will humanity’s ability to find new ways to generate more and more data. Storage space often seems unlimited… until it’s not! In this project, we will design a helpful command line utility for users, developers, and system administrators to analyze how their disk space is being used. Here’s a demonstration of the tool, da:

$ ./da -l 15 -s /usr
  /usr/lib/valgrind/libvex-amd64-linux.a      36.5 MiB    Aug 21 2020
                    /usr/lib/      38.6 MiB    Aug 21 2020
                 /usr/lib/      38.6 MiB    Aug 21 2020
                     /usr/bin/containerd      46.9 MiB    Feb 01 2021
                /usr/lib/      47.2 MiB    Mar 08 2021
             /usr/lib/      47.2 MiB    Mar 08 2021
                       /usr/lib/      52.4 MiB    Sep 09 2020
                    /usr/lib/      52.4 MiB    Sep 09 2020
                /usr/lib/      52.4 MiB    Sep 09 2020
.../lib/docker/cli-plugins/docker-buildx      54.2 MiB    Dec 03 2020
                         /usr/bin/docker      71.0 MiB    Dec 03 2020
              /usr/lib/      83.7 MiB    Mar 08 2021
                  /usr/lib/      83.7 MiB    Mar 08 2021
                     /usr/lib/      83.7 MiB    Mar 08 2021
                        /usr/bin/dockerd      84.4 MiB    Feb 01 2021

In this example, the user requested the top 15 files (-l 15), sorted by size (-s) from the /usr directory. If they’re really trying to save space on this machine, then maybe it’s time to remove docker? :-)

The output columns include the file name, file size in human readable units, and the last time the particular file was accessed.

To get a sense of the functionality we will implement, take a look at the help/usage information:

$ ./da -h
Disk Analyzer (da): analyzes disk space usage
Usage: ./da [-ahs] [-l limit] [directory]

If no directory is specified, the current working directory is used.

    * -a              Sort the files by time of last access (descending)
    * -h              Display help/usage information
    * -l limit        Limit the output to top N files (default=unlimited)
    * -s              Sort the files by size (default, ascending)

Your implementation will be split into two parts: (1) building an elastic data structure that can store an unbounded number of elements (memory permitting), and (2) directory traversal and disk usage analysis.

You can think of the elastic array as being somewhat analogous to the ArrayList in Java; it will automatically resize, allow a variety of retrieval operations, and provide utility functionality such as retrieving the number of elements, trimming the amount of heap space used to save memory, and sorting the elements. When you are finished, you’ll have produced reusable library that may be helpful in future C projects.

The Elastic Array

While C has primitive array types, they must be dimensioned in advance and do not support convenience features like appending to the list or retrieving its size. Our goal for the elist library is to fill this gap in functionality. Your elist should support the following functions:

Array elements will have a fixed size; i.e., the expected size of the elements will be provided to elist_create. This could be something like sizeof(int) or even sizeof(struct my_special_struct), but regardless all elements will consume the same amount of bytes on the heap.

Elements added to the list via add or set will be copied onto the list on the heap; your array should not simply store pointers to the elements. This provides the most flexibility, since the user could maintain an array of pointers if that is the behavior they desire. The add_new function will return a pointer to a new, uninitialized memory block in the list so that the user can populate it with data to simplify usage and avoid extra copies when unnecessary:

struct my_struct *s = malloc(sizeof(struct my_struct));
s->memb1 = 123;
s->memb2 = 456;
elist_add(list, s); // 's' is copied into the list

// vs.

struct my_struct *s = elist_add_new(list);
s->memb1 = 123;
s->memb2 = 456;

The array will start with an initial capacity, and once full you will double the capacity (RESIZE_MULTIPLIER = 2) and realloc the array’s storage. Removing a list element shifts the entire list; empty gaps are not allowed. The array will not be shrunk unless requested via set_capacity, and if elements exist beyond the requested new capacity then they will be freed.

There are several C functions that will allow you to manipulate the memory allocated to the list. Some functions you may be interested in investigating include memcpy, memcmp, memmove, and memset.

To allow sorting functionality, you can use qsort(3). The user will provide a comparator that your sort function passes to qsort.

The Disk Usage Analyzer

The disk analyzer will traverse the file system recursively, locating all the files under a given directory. During traversal, each file’s full path, size, and last access time will be recorded in our elastic array for further inspection, sorting, and final formatting.

You will most likely want to use opendir and readdir to provide this listing, and stat to retrieve access times and file sizes. It is recommended to store this information in a struct for each file, and place the structs in your elastic array.


Working right to left, the list output shown above is formatted as follows:

To determine the size of the terminal, you can use the following:

unsigned short cols = 80;
struct winsize win_sz;
if (ioctl(fileno(stdout), TIOCGWINSZ, &win_sz) != -1) {
    cols = win_sz.ws_col;
LOG("Display columns: %d\n", cols);

Note that since this won’t always work (for a variety of reasons), we default to the standard terminal width of 80 columns.

As part of your client code, you will need to write functions to perform unit conversions (bytes to human-readable units, like MiB, GiB, and so on) and format the date strings as shown in the demo above. For the date conversion, you are allowed to use strftime, and snprintf may help simplify your human_readable_size function. You should support units up to ZiB (zebibyte). Note that we are using units based on powers of 2, so the abbreviations will be KiB, MiB, etc. as opposed to KB, MB, and so on.

Learning Objectives

Implementation Restrictions

Restrictions: you may use any standard C library functionality. External libraries are not allowed unless permission is granted in advance. If in doubt, ask first. Your code must compile and run on your VM set up with Arch Linux as described in class – failure to do so will receive a grade of 0.

Testing Your Code

Check your code against the provided test cases. We’ll have interactive grading for projects, where you will demonstrate program functionality and walk through your logic.

Submission: submit via GitHub by checking in your code before the project deadline.


Extra Credit