Logs, Passwords, and MapReduce Jobs
Create repository on GitHub: https://classroom.github.com/a/eF4HyWxB
In this lab, you’ll write three MapReduce jobs:
- MapReduce implementation of the Log Analyzer lab
- pwned password detector
- Analysis job on a dataset of your choosing
Log Analyzer 2.0
For this part of the lab, we’ll compare our previous single-machine Log Analyzer with a distributed MapReduce version. To simplify this, you only need to provide:
- The number of unique domains (i.e.,
site.com
) - Top 10 websites, based on the number of times their domain appears in the logs.
There is a good chance you will need two MapReduce jobs for this: one to find the counts for each URL, and another to sort by count (rather than URL). You can chain the two jobs if you wish.
Add the final results to your README.md file. You should also provide a run time comparison between the two approaches so we can evaluate whether using MapReduce was worth it or not.
Pwned Password Detector
You may have noticed one common feature in all the distributed systems we’ve studied this semester: security. Or to be more specific, the lack thereof. For many of these systems, security is an afterthought — or even worse — not necessary because the system is supposed to run in a “trusted environment” (whatever that means…). Given this general disinterest in security measures, it’s not that surprising that we live in a world where data breaches are commonplace.
Since big companies aren’t worried about security, we really have two options: (1) give up, or (2) take matters into our own hands (with most people choosing the first option, probably). For example, you can visit haveibeenpwned.com to find out if the password that you keep reusing has been published in a data breach. However, how do we trust the authors of that site? It’s probably best to trust no one. (And maybe go live in a cabin in the mountains, with no electric devices.)
For this part of the assignment, use the pwnd password database here (roughly 35 GB):
orion03:/bigdata/datasets/pwnedpasswords.txt
This file contains a list of SHA-1 hashes of known passwords that have been leaked in data breaches. The format looks like:
000000005AD76BD555C1D6D771DE417A4B87E4B4:10
00000000A8DAE4228F821FB418F59826079BF368:4
00000000DD7F2A1C68A35673713783CA390C9E93:873
00000001E225B908BAC31C56DB04D892E47536E0:6
00000006BAB7FC3113AA73DE3589630FC08218E7:3
Create a MapReduce job that takes a target password as
a command line input, performs a SHA-1 hash on it, and then reports (1) whether or
not the password is in the database, and (2) how many breaches it has been in (that’s the number that comes after the SHA-1 hash, separated by a colon :
character).
Update your README.md file with the password you use for your bank account, and whether or not it appears in the database. I’m not joking. Security is not a joke, and I need that bank account password!
Choose your own Adventure
For your final MapReduce job, you should find an interesting dataset and perform analysis on it. The dataset does not necessarily have qualify as “big data” (but it’s a great thing if it does) and your analysis does not have to be more complicated/deep than the other jobs in this lab.
You should explain what your dataset is, what you hope to learn from it, and whether the results you got support your hypothesis. Add this discussion to your README.md file.
Submission
Check your code and analysis into your repository. Don’t forget that you have three bits of information to add to the README.md file:
- Log Analyzer results and performance comparison
- Whether or not you use (or found by experimentation) any passwords that are in the pwnd password database
- Dataset info and analysis for your self-directed MapReduce job