Lab 1: Log Analyzer

Create repository on GitHub: https://classroom.github.com/a/zaGWaxqe

One day, two USF students named Harry Mage and Sally Flin decide to create a new web search engine based on a secret algorithm they invented. Naturally any great search engine is backed by an equally-great web crawler – a bot that collects every page it can find on the Internet for indexing and analysis. After building their web crawler and setting it loose on the Internet, they decide to use the logs generated by the bot to create a dashboard with some general information about their web index.

The log files look something like this:

2023-01-24	21:04	129.82.45.181	https://siteA.com/something.html
2023-01-24	22:11	99.47.169.73	https://siteA.com/wiki/
2023-01-24	23:55	129.82.45.181	https://siteB.net/things/
2023-01-24	23:58	129.82.45.181	https://siteB.net/hat124.html
2023-01-24	21:04	129.84.45.181	https://siteB.net/hat16.html
2023-01-24	22:11	99.47.169.73	https://siteB.net/hat2.html
2023-01-24	23:55	129.82.45.181	https://siteA.com/faq.html
2023-01-24	23:58	129.82.45.181	https://siteC.org/test.html
2023-01-24	20:14	129.82.45.181	https://siteC.org/x.php
2023-01-24	20:26	129.82.45.181	https://segv.cloud/
2023-01-24	20:45	129.82.45.181	https://testsite.com/
2023-01-24	20:46	129.82.45.183	https://siteB.com/hat3.html
2023-01-24	21:04	129.82.45.181	https://siteF.net/secret-page.aspx
2023-01-24	22:11	99.47.169.73	https://siteF.net/not-secret.php
2023-01-24	23:55	129.82.45.189	https://siteB.com/hat9.html
2023-01-24	23:58	129.82.45.181	https://siteB.com/hat124.html

Or, in other words, the fields in the logs contain the last day the page was indexed, the time it was indexed, the IP address of the bot that indexed the page, and the page URL.

Each field is separated by whitespace – a tab character (\t) or spaces ( ).

Your Job

Write a Go program that takes one or more log files as its command line arguments and produces a report that contains:

Number of unique URLs in the logs
Number of unique domains (i.e., site.com)
Top 10 websites, based on the number of times their domain appears in the logs.
Top 5 busiest crawler bots based on how often their IP appears in the logs.

Here’s a demo run of the program. Your output formatting doesn’t have to match exactly, you just need to have the same features shown below:

$ go run lab1.go log1.txt log2.txt
Reading log1.txt...
Reading log2.txt...

* Unique URLs: 182
* Unique Domains: 14
* Top 10 Websites:
    - Site1.com
    - Site2.com
        ...
    - Site10.net
* Top 5 crawlers:
    - 129.82.45.180
    - 129.82.11.60
    - 138.202.169.10
    - 99.47.169.73
    - 99.47.169.74

Completed in 5.8s.

Don’t worry too much if you didn’t build the most optimal or perfect program. The main goal of this assignment is to help you acclimate to writing Go code and think about the problems you might face when dealing with large datasets.

Thought Experiment

After completing the functionality described above, reflect on your approach. Can you think of any weaknesses? Would your program be able to handle very large log files – for instance, if the logs were 100 GB, 1 TB, etc? If not, how would you handle that situation?

Submission

Commit lab1.go to your GitHub repository and don’t forget to edit its README.md with your answer to the question above.