Project 4: Cloud Storage Service (v 1.0)

Starter repository on GitHub: https://classroom.github.com/a/SgbJsTcw

In this project, you will build your own cloud storage system similar to Dropbox or Google Drive, with resilient, replicated backend servers and a command line client application. Specific features of the system include:

Storage: you should be able to put any type of file (text, images, binaries, and so on). Given enough disk space (and time to transfer the data), you should be able to support arbitrarily large file sizes.
Retrieval: beyond usual file retrievals with get, you should be able to search and list the files in the system
Scalability: concurrent storage and retrieval operations should be supported, as well as handling multiple clients.
Replication: backend servers will ensure that all files are replicated for fault tolerance. If a backend server goes down, you should be able to contact a replica for the file.
Resiliency: your cloud storage system should be resilient to disk or memory failures. That means that if a file is stored on a disk and gets corrupted, you should be able to detect the corruption, retrieve a replica, and repair the file.

You will need to use the go standard library and socket programming to complete this project. However, external libraries may be allowed if they do not implement core functionality (ask first!)

NOTE: you are allowed to work in teams of 2 for this project if you’d like.

Components

Your cloud storage system will have two components:

Storage Server: handles storage/retrieval/search operations. Will replicate files to another storage server instance.
- You should have at least 2 storage servers. When a storage server starts up, you should provide it with a list of hostname:port pairs of the other storage server(s).
Client: can send requests to any of the storage servers. You will likely want to supply the storage server’s hostname and port as command line options.

Storage/Retrieval Operations

Both the server and client in your system will support a variety of messages that influence behavior. You may design your own protocol to implement these operations:

put fileName
get fileName
delete fileName
search string (note that this string could be blank to search for all files, implementing the list functionality. You only need to do basic substring matching.)

If a client tries to put file that already exists, you can reject the operation. Instead, they should delete the existing file first. Or, if you’re feeling adventurous, you could add an overwrite operation that automatically does a delete followed by a put.

To ensure the system is trustworthy, each of these operations should be acknowledged as either successful or a failure. Users are generally willing to retry storage operations if/when they fail, so it’s better to be explicit about failures.

Handling Replication and Failures

When a client stores a file, the server should ensure that it has been replicated before acknowledging the storage operation as successful. To do this, the first server will contact the replica server and ask it to store the file as well. You must support at least 2 storage servers in this project (meaning every file is stored twice for redundancy), but you are welcome to support higher replication levels if you wish. The current best practice at companies like Google or Amazon is to store 3-5 replicas of every file.

If a replica server goes down, then your only option is to reject put operations. This may seem unintuitive – why not store the file and wait for the other server to come back up later? At that point, you could synchronize files between the two machines and continue operating normally. However, “synchronize the files” is something that seems relatively simple on the surface but quickly descends into a multitude of edge cases.

However, note that it is completely reasonable to allow get operations when the system is in a degraded state (one storage server has gone down).

Detecting and Repairing File Corruption

When storing a file, your server should also store its checksum (you can use any hash algorithm you’d like). For example, if you have my_file.txt you may also store my_file.txt.checksum in a separate directory. When retrieving a file, you’ll read the file, checksum it, and verify that the checksum matches the original checksum stored on disk. If the file is corrupted, contact a backup server to repair the file.

Tips and Resources

Log events as they happen in your system! Distributed systems are difficult to debug, so every log message helps.
You can test your system on a local machine – simply have the storage servers run on different ports and use localhost as the hostname. However, it is crucial to test your system in a real-world environment (i.e., storage servers on separate machines and a separate client machine)

Testing and Grading

Since certain people coughMatthewcough are terribly slow and probably unable to implement a robust test suite before the semester ends, we’ll do testing and grading somewhat differently for this assignment.

You’ll set up your system and then perform the following tasks to demonstrate functionality. Point values are shown by each test.

[2] Store several files.
[2] Test a search string (should only match files that contain the string).
[1] List files.
[3] Retrieve files from both storage servers. Their sha1sum should match the original file that was stored.
- It is crucial that you test the sha1sum of the file upon retrieval. Many file formats, such as JPEG or even text files, tend to be fairly resilient to corruption and appear to work even though their contents have changed slightly.
[1] Kill one of the storage servers (picked randomly), and try to store a file. The operation should be rejected.
[1] Retrieve a file from a working storage server. This should be allowed even in a degraded state.
[3] Corrupt a file on one of the storage servers by modifying it, then attempt to retrieve the corrupted file. The corruption should be detected, the file repaired, and the correct data returned to the client.

For the remaining 2 points, you should include a README with detailed instructions on how to set up and use your system. Since this project is less prescriptive, make sure to discuss your design and the logic behind the decisions you made. Creativity is encouraged!

NOTE testing will be performed on our VMs. Make sure your project works on the test environment before turning the project in!

Test Dataset

Find the test dataset that will be used for grading here: p4-dataset.tar.gz. You can download it to your VM with wget. Here are the files' names, sizes, and sha1 checksums:

$ wget 'https://www.cs.usfca.edu/~mmalensek/cs521/assignments/p4-dataset.tar.gz'
$ tar xf p4-dataset.tar.gz
$ cd p4-dataset
$ for i in *.bin; do stat --printf '%s ' $i; sha1sum $i; done | column -t
3           ca919eca39b3ed092622b8ae0875ddd0d637254e  test1.bin
12601476    34a308cf63ae2f20bd061733f3e0c1db6577332f  test2.bin
57690244    28b047c55ed0b68df52cf931d973e76aade87545  test3.bin
127944836   4a6d2c9b72e511436b2cf8c075c0a395f4be8de9  test4.bin
719332421   c8166f20e8bdc7d79fb6c7ae36dc170d98abee85  test5.bin
1164986500  2be1550d2d44c578efc1297e17a9652633353a7f  test6.bin

You should make sure that the files that are stored (with put) match these SHA-1 checksums when they are retrieved (with get). Even if the files look the same, you need to be sure by fingerprinting them with a hash function.

Changelog

Initial project specification posted (5/5)
Added test dataset info (5/12)