Project 1: Elastic Array & Disk Usage Analyzer (v1.0)

Starter repository on GitHub: https://classroom.github.com/a/8GPH5Zwq

As storage densities continue to increase, so too will humanity’s ability to find new ways to generate more and more data. Storage space often seems unlimited… until it’s not! In this project, we will design a helpful command line utility for users, developers, and system administrators to analyze how their disk space is being used. Here’s a demonstration of the tool, da:

$ ./da -n 15 -s /usr
/usr/lib/valgrind/libvex-amd64-linux.a       9.1 MiB    21 Aug 2020
/usr/lib/libclang.so                        38.6 MiB    21 Aug 2020
/usr/lib/libclang.so.10                     38.6 MiB    21 Aug 2020
/usr/bin/containerd                         46.9 MiB    01 Feb 2021
/usr/lib/libclang-cpp.so                    47.2 MiB    15 Feb 2021
/usr/lib/libclang-cpp.so.10                 47.2 MiB    15 Feb 2021
/usr/lib/libgo.so                           52.4 MiB    09 Sep 2020
/usr/lib/libgo.so.16                        52.4 MiB    09 Sep 2020
/usr/lib/libgo.so.16.0.0                    52.4 MiB    09 Sep 2020
/usr/lib/docker/cli-plugins/docker-buildx   54.2 MiB    03 Dec 2020
/usr/bin/docker                             71.0 MiB    03 Dec 2020
/usr/lib/libLLVM-10.0.1.so                  83.7 MiB    15 Feb 2021
/usr/lib/libLLVM-10.so                      83.7 MiB    15 Feb 2021
/usr/lib/libLLVM.so                         83.7 MiB    15 Feb 2021
/usr/bin/dockerd                            84.4 MiB    01 Feb 2021

In this example, the user requested the top 15 files (-t 15), sorted by size (-s) from the /usr directory. If they’re really trying to save space on this machine, then maybe it’s time to remove docker? :-)

The output columns include the file name, file size in human readable units, and the last time the particular file was accessed.

To get a sense of the functionality we will implement, take a look at the help/usage information:

$ ./da -h
Disk Analyzer (da): analyzes disk space usage
Usage: ./da [-ahs] [-t limit] [directory]

If no directory is specified, the current working directory is used.

Options:
    * -a              Sort the files by time of last access
    * -h              Display help/usage information
    * -s              Sort the files by size (default)
    * -t limit        Limit the output to top N files (default=unlimited)

Your implementation will be split into two parts: (1) building an elastic data structure that can store an unbounded number of elements (memory permitting), and (2) directory traversal and disk usage analysis.

You can think of the elastic array as being somewhat analogous to the ArrayList in Java; it will automatically resize, allow a variety of retrieval operations, and provide utility functionality such as retrieving the number of elements, trimming the amount of heap space used to save memory, and sorting the elements. When you are finished, you’ll have produced reusable library that may be helpful in future C projects.

The Elastic Array

While C has primitive array types, they must be dimensioned in advance and do not support convenience features like appending to the list or retrieving its size. Our goal for the elist library is to fill this gap in functionality. Your elist should support the following functions:

Array elements will have a fixed size; i.e., the expected size of the elements will be provided to elist_create. This could be something like sizeof(int) or even sizeof(struct my_special_struct), but regardless all elements will consume the same amount of bytes on the heap.

Elements added to the list via add or set will be copied onto the list on the heap; your array should not simply store pointers to the elements. This provides the most flexibility, since the user could maintain an array of pointers if that is the behavior they desire. The add_new function will return a pointer to a new, uninitialized memory block in the list so that the user can populate it with data to simplify usage and avoid extra copies when unnecessary:

struct my_struct *s = malloc(sizeof(struct my_struct));
s->memb1 = 123;
s->memb2 = 456;
elist_add(list, s); // 's' is copied into the list

// vs.

struct my_struct *s = elist_add_new(list);
s->memb1 = 123;
s->memb2 = 456;

The array will start with an initial capacity, and once full you will double the capacity (RESIZE_MULTIPLIER = 2) and realloc the array’s storage. Removing a list element shifts the entire list; empty gaps are not allowed. The array will not be shrunk unless requested via set_capacity, and if elements exist beyond the requested new capacity then they will be freed.

To allow sorting functionality, you can use qsort(3). The user will provide a comparator that your sort function passes to qsort.

The Disk Usage Analyzer

The disk analyzer will traverse the file system recursively, locating all the files under a given directory. During traversal, each file’s full path, size, and last access time will be recorded in our elastic array for further inspection, sorting, and final formatting.

You will most likely want to use opendir and readdir to provide this listing, and stat to retrieve access times and file sizes.

As part of your client code, you will need to write functions to perform unit conversions (bytes to human-readable units, like MiB, GiB, and so on) and format the date strings as shown in the demo above.

Implementation Restrictions

Restrictions: you may use any standard C library functionality. External libraries are not allowed unless permission is granted in advance. If in doubt, ask first. Your code must compile and run on your VM set up with Arch Linux as described in class – failure to do so will receive a grade of 0.

Testing Your Code

Check your code against the provided test cases. We’ll have interactive grading for projects, where you will demonstrate program functionality and walk through your logic.

Submission: submit via GitHub by checking in your code before the project deadline.

Grading

Extra Credit

Changelog