Project 2: Elastic Array & Disk Usage Analyzer

Starter repository on GitHub: https://classroom.github.com/a/pmHQ83g4

As storage densities continue to increase, so too will humanity’s ability to find new ways to generate more and more data. Storage space often seems unlimited… until it’s not! In this project, we will design a helpful command line utility for users, developers, and system administrators to analyze how their disk space is being used. Here’s a demonstration of the tool, da:

$ ./da -l 15 -s /usr
  32.4 MiB | Aug 21 2022 | /usr/lib/valgrind/libvex-armv8-linux.a      
  36.5 MiB | Aug 21 2022 | /usr/lib/valgrind/libvex-amd64-linux.a      
  38.6 MiB | Aug 21 2022 | /usr/lib/libclang.so      
  38.6 MiB | Aug 21 2022 | /usr/lib/libclang.so.10      
  46.9 MiB | Feb 01 2023 | /usr/bin/containerd      
  47.2 MiB | Mar 08 2023 | /usr/lib/libclang-cpp.so      
  47.2 MiB | Mar 08 2023 | /usr/lib/libclang-cpp.so.10      
  52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so      
  52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so.16      
  52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so.16.0.0      
  71.0 MiB | Dec 03 2022 | /usr/bin/docker      
  83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM-10.0.1.so      
  83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM-10.so      
  83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM.so      
  84.4 MiB | Feb 01 2023 | /usr/bin/dockerd      

In this example, the user requested the top 15 files (-l 15), sorted by size (-s) from the /usr directory. If they’re really trying to save space on this machine, then maybe it’s time to remove docker? :-)

The output columns include the file size in human readable units, the last time the particular file was accessed, and the file name.

To get a sense of the functionality we will implement, take a look at the help/usage information:

$ ./da -h
Disk Analyzer (da): analyzes disk space usage
Usage: ./da [-ahs] [-l limit] [directory]

If no directory is specified, the current working directory is used.

Options:
    * -a              Sort the files by time of last access (descending)
    * -h              Display help/usage information
    * -l limit        Limit the output to top N files (default=unlimited)
    * -s              Sort the files by size (default, ascending)

Your implementation will be split into two parts: (1) building an elastic data structure that can store an unbounded number of elements (memory permitting), and (2) directory traversal and disk usage analysis. You will be able to leverage your code from the previous project to help you complete the directory traversal.

You can think of the elastic array as being somewhat analogous to the ArrayList in Java; it will automatically resize, allow a variety of retrieval operations, and provide utility functionality such as retrieving the number of elements, trimming the amount of heap space used to save memory, and sorting the elements. When you are finished, you’ll have produced reusable library that may be helpful in future C projects.

The Elastic Array

While C has primitive array types, they must be dimensioned in advance and do not support convenience features like appending to the list or retrieving its size. Our goal for the elist library is to fill this gap in functionality. Your elist should support the following functions:

Array elements will have a fixed size; i.e., the expected size of the elements will be provided to elist_create. This could be something like sizeof(int) or even sizeof(struct my_special_struct), but regardless all elements will consume the same amount of bytes on the heap.

Elements added to the list via add or set will be copied onto the list on the heap; your array should not simply store pointers to the elements. This provides the most flexibility, since the user could maintain an array of pointers if that is the behavior they desire. The add_new function will return a pointer to a new, uninitialized memory block in the list so that the user can populate it with data to simplify usage and avoid extra copies when unnecessary:

struct my_struct *s = malloc(sizeof(struct my_struct));
s->memb1 = 123;
s->memb2 = 456;
elist_add(list, s); // 's' is copied into the list

// vs.

struct my_struct *s = elist_add_new(list);
s->memb1 = 123;
s->memb2 = 456;

The array will start with an initial capacity, and once full you will double the capacity (RESIZE_MULTIPLIER = 2) and realloc the array’s storage. Removing a list element shifts the entire list; empty gaps are not allowed. The array will not be shrunk unless requested via set_capacity, and if elements exist beyond the requested new capacity then they will be freed.

There are several C functions that will allow you to manipulate the memory allocated to the list. Some functions you may be interested in investigating include memcpy, memcmp, memmove, and memset.

To allow sorting functionality, you can use qsort(3). The user will provide a comparator that your sort function passes to qsort.

The Disk Usage Analyzer

The disk analyzer will traverse the file system recursively, locating all the files under a given directory. During traversal, each file’s full path, size, and last access time will be recorded in our elastic array for further inspection, sorting, and final formatting.

You will most likely want to use opendir and readdir to provide this listing, and stat to retrieve access times and file sizes. It is recommended to store this information in a struct for each file, and place the structs in your elastic array.

Output Formatting

Working left to right, the list output shown above is formatted as follows:

Note that you can pass sizes in as part of your format strings, e.g.:

printf("%10s | %11s\n", var1, var2);

would print var1 and var2 as 10-character and 11-character columns, respectively. Another fun fact: you can pass a variable width to printf like so: printf("%*s\n", 10, str); would be a 10-character string.

To make the output more readable for human beigns, write functions to perform unit conversions (bytes to human-readable units, like MiB, GiB, and so on) and format the date strings as shown in the demo above. For the date conversion, you are allowed to use strftime, and snprintf may help simplify your human_readable_size function. You should support units up to ZiB (zebibyte). Note that we are using units based on powers of 2, not SI units, so the abbreviations will be KiB, MiB, etc. as opposed to KB, MB, and so on.

Learning Objectives

Implementation Restrictions

Restrictions: you may use any standard C library functionality. External libraries are not allowed unless permission is granted in advance. If in doubt, ask first. Your code must compile and run on your VM set up with Arch Linux as described in class – failure to do so will receive a grade of 0.

Testing Your Code

Check your code against the provided test cases. We’ll have interactive grading for projects, where you will demonstrate program functionality and walk through your logic.

Submission: submit via GitHub by checking in your code before the project deadline.

Grading

Check your code against the provided test cases. You should make sure your code runs on your Arch Linux VM.

Submission: submit via GitHub by checking in your code before the project deadline.

Your grade is based on:

Changelog