Lab 1: Log Analyzer
Create repository on GitHub: https://classroom.github.com/a/zaGWaxqe
One day, two USF students named Harry Mage and Sally Flin decide to create a new web search engine based on a secret algorithm they invented. Naturally any great search engine is backed by an equally-great web crawler – a bot that collects every page it can find on the Internet for indexing and analysis. After building their web crawler and setting it loose on the Internet, they decide to use the logs generated by the bot to create a dashboard with some general information about their web index.
The log files look something like this:
2023-01-24 21:04 129.82.45.181 https://siteA.com/something.html
2023-01-24 22:11 99.47.169.73 https://siteA.com/wiki/
2023-01-24 23:55 129.82.45.181 https://siteB.net/things/
2023-01-24 23:58 129.82.45.181 https://siteB.net/hat124.html
2023-01-24 21:04 129.84.45.181 https://siteB.net/hat16.html
2023-01-24 22:11 99.47.169.73 https://siteB.net/hat2.html
2023-01-24 23:55 129.82.45.181 https://siteA.com/faq.html
2023-01-24 23:58 129.82.45.181 https://siteC.org/test.html
2023-01-24 20:14 129.82.45.181 https://siteC.org/x.php
2023-01-24 20:26 129.82.45.181 https://segv.cloud/
2023-01-24 20:45 129.82.45.181 https://testsite.com/
2023-01-24 20:46 129.82.45.183 https://siteB.com/hat3.html
2023-01-24 21:04 129.82.45.181 https://siteF.net/secret-page.aspx
2023-01-24 22:11 99.47.169.73 https://siteF.net/not-secret.php
2023-01-24 23:55 129.82.45.189 https://siteB.com/hat9.html
2023-01-24 23:58 129.82.45.181 https://siteB.com/hat124.html
Or, in other words, the fields in the logs contain the last day the page was indexed, the time it was indexed, the IP address of the bot that indexed the page, and the page URL.
Each field is separated by whitespace – a tab character (\t
) or spaces (
).
Your Job
Write a Go program that takes one or more log files as its command line arguments and produces a report that contains:
- Number of unique URLs in the logs
- Number of unique domains (i.e.,
site.com
) - Top 10 websites, based on the number of times their domain appears in the logs.
- Top 5 busiest crawler bots based on how often their IP appears in the logs.
Here’s a demo run of the program. Your output formatting doesn’t have to match exactly, you just need to have the same features shown below:
$ go run lab1.go log1.txt log2.txt
Reading log1.txt...
Reading log2.txt...
* Unique URLs: 182
* Unique Domains: 14
* Top 10 Websites:
- Site1.com
- Site2.com
...
- Site10.net
* Top 5 crawlers:
- 129.82.45.180
- 129.82.11.60
- 138.202.169.10
- 99.47.169.73
- 99.47.169.74
Completed in 5.8s.
Don’t worry too much if you didn’t build the most optimal or perfect program. The main goal of this assignment is to help you acclimate to writing Go code and think about the problems you might face when dealing with large datasets.
Thought Experiment
After completing the functionality described above, reflect on your approach. Can you think of any weaknesses? Would your program be able to handle very large log files – for instance, if the logs were 100 GB, 1 TB, etc? If not, how would you handle that situation?
Submission
Commit lab1.go
to your GitHub repository and don’t forget to edit its README.md
with your answer to the question above.