Project 1: Parallel File Search Tool (v 1.1)
Starter repository on GitHub: https://classroom.github.com/a/LRRalqkO
Our journey through the operating system starts in userland (user space), outside the kernel. In this project, we’ll implement a Unix utility that recursively searches for matching words in text files. If you’ve ever used the
grep command from a shell, our program will be somewhat similar, except:
- It operates recursively by default (traversing into subdirectories)
- Only entire words are matched, not partial words (i.e., searching for ‘the’ does not match ‘theme’)
- The line number where the matching search term was found is printed
A good approximation of these features with
grep would be using the
-Rnw flags, like:
grep -Rnw term1 term2 term3
Our version of the tool will make use of multiple threads running in parallel, so we’ll call it
prep. To give you an idea of how your program will work, here’s a quick example:
# Searches for hello in all the files located in /etc. Note that case is # ignored, and the line number where the match was found is also included. # Line numbers start at 1, not 0. $ ./prep -d /etc HELLO /etc/services:1118:hello-port 652/tcp /etc/services:1119:hello-port 652/udp /etc/services:2919:hello 1789/tcp /etc/services:2920:hello 1789/udp /etc/services:5007:aimpp-hello 2846/tcp /etc/services:5008:aimpp-hello 2846/udp # With the -e flag, the match is case-sensitive. No results are returned: $ ./prep -d /etc -e HELLO # Here we find a name in three different files. # Each file will be searched by a different thread: $ ./prep -d /usr/share manoj /usr/share/locale/or/LC_MESSAGES/cracklib.mo:8:Last-Translator: Manoj Kumar Giri <email@example.com> /usr/share/locale/or/LC_MESSAGES/Linux-PAM.mo:37:Last-Translator: Manoj Kumar Giri <firstname.lastname@example.org> /usr/share/locale/or/LC_MESSAGES/glib20.mo:197:Last-Translator: Manoj Kumar Giri <email@example.com> # We can specify multiple search terms, of course: $ ./prep -d /usr/share whitman nutella kapow stranger /usr/share/cracklib/cracklib-small:47267:stranger /usr/share/cracklib/cracklib-small:53793:whitman /usr/share/perl5/core_perl/pod/perlpacktut.pod:596:An even stranger template code is C<%>E<lt>I<number>E<gt>. First, because /usr/share/perl5/core_perl/pod/perlcall.pod:1462:eventually consume all the available memory in your system--kapow! # By default, prep will search the current working directory (CWD). # The full path is always printed. $ ./prep main /home/matthew/P1-Solution/prep.c:141:int main(int argc, char *argv) # We can 'cd' somewhere else and then run prep from there. # This run also limits the number of threads to 2. $ cd /etc $ ~/P1-Solution/prep -e -t2 absolutely /etc/lvm/lvm.conf:1538: # you are absolutely sure about what you are doing! /etc/lvm/lvm.conf:1622: # by hand unless you are absolutely sure you know what you are doing!<Paste>
Note that the output format is:
/absolute/path/to/file:line-number:the entire line the word was found in
An absolute path starts from the root directory:
/. You can tell whether a path is absolute or relative by looking at the first character: if it’s
/, the path is absolute. Otherwise, it’s relative (e.g.,
./blah, or even
If multiple matches are present on a single line, only print it once. You should also remove punctuation when you are searching for words; the punctuation removed in the examples above is:
Along with spaces.
Since this is a parallel search, your implementation should detect the number of cores on the machine and use this number as the default upper bound for threads launched by the program. For each file that you find (recursively), you will launch a thread that looks for occurrences of the search term(s) specified. If there are more files than threads available, then you should wait until a thread finishes before starting another. Using a semaphore from the pthreads library is a good way to accomplish this.
In this assignment, you will get experience working with:
readdirfunctions for listing directory contents
statfor getting file information
- Argument parsing with
- Detecting active CPU cores on a machine (
There are a few other features you need to implement. We’ll let the program do the talking by printing usage information (-h option):
$ ./prep -h Usage: ./prep [-eh] [-d directory] [-t threads] search_term1 search_term2 ... search_termN Options: * -d directory specify start directory (default: CWD) * -e print exact case matches only * -h show usage information * -t threads set maximum threads (default: num CPUs) # Note that ANY time the user passes in -h, you'll ignore the other options: $ ./prep -e -t 4 -d / -h (displays help, and exits)
Testing Your Code
You should make sure your code runs on the Raspberry Pi. We’ll have interactive grading for projects, where you will demonstrate program functionality and walk through your logic.
Our recommendation is to start out with working on the directory listing. Next, implement the word search functionality. Finally, parallelize your logic using pthreads.
Submission: submit via GitHub by checking in your code before the project deadline. You must include a makefile with your project. As part of the testing process, we will check out your code and run
make to build it.
- 4 pts - Recursive directory listing logic
- 4 pts - Opening and splitting files into lines and words
- 2 pts - Locating search terms and reporting (printing) the results
- 1 pts - Exact case match (
- 1 pts - Configurable start directory (
- 3 pts - Parallelizing the search
- 2 pts - Limiting the maximum number of threads started
- 1 pts - Usage function and detecting invalid options
- 1 pts - Makefile
- 2 pts - Proper cleanup: directory entries, memory, etc. No memory leaks.
- 3 pts - Code formatting, organization, and documentation:
- Each function should have a description of its inputs, outputs, and purpose (unless it’s so trivial that no explanation is required).
- Complicated code segments should include comments to describe functionality.
- No dead, leftover, or unnecessary code.
- 1 pts - Lab Checkpoint: Demonstrate that you can print a recursive directory listing using
readdirwithin two weeks.
- 1 pts - Colorize the output (like
grepdoes), highlighting words that match.
- 1 pts - Binary file detection. Without this modification
prepwill print binary file data, which can break your terminal. Detect binary files and print a message, e.g., ‘Binary file somefilename.bin matches’ instead of the matching line.
Restrictions: you may use any standard C library functionality. External libraries are not allowed unless permission is granted in advance. Your code must compile and run on your Raspberry Pi set up with Arch Linux as described in class – failure to do so will receive a grade of 0.
- Added output format info and punctuation removal (9/11)
- Initial project specification posted (8/29)