Project 4: Cloud Storage Service (v 1.0)
Starter repository on GitHub: https://classroom.github.com/a/SgbJsTcw
In this project, you will build your own cloud storage system similar to Dropbox or Google Drive, with resilient, replicated backend servers and a command line client application. Specific features of the system include:
- Storage: you should be able to
put
any type of file (text, images, binaries, and so on). Given enough disk space (and time to transfer the data), you should be able to support arbitrarily large file sizes. - Retrieval: beyond usual file retrievals with
get
, you should be able tosearch
andlist
the files in the system - Scalability: concurrent storage and retrieval operations should be supported, as well as handling multiple clients.
- Replication: backend servers will ensure that all files are replicated for fault tolerance. If a backend server goes down, you should be able to contact a replica for the file.
- Resiliency: your cloud storage system should be resilient to disk or memory failures. That means that if a file is stored on a disk and gets corrupted, you should be able to detect the corruption, retrieve a replica, and repair the file.
You will need to use the go standard library and socket programming to complete this project. However, external libraries may be allowed if they do not implement core functionality (ask first!)
NOTE: you are allowed to work in teams of 2 for this project if you’d like.
Components
Your cloud storage system will have two components:
- Storage Server: handles storage/retrieval/search operations. Will replicate files to another storage server instance.
- You should have at least 2 storage servers. When a storage server starts up, you should provide it with a list of
hostname:port
pairs of the other storage server(s).
- You should have at least 2 storage servers. When a storage server starts up, you should provide it with a list of
- Client: can send requests to any of the storage servers. You will likely want to supply the storage server’s hostname and port as command line options.
Storage/Retrieval Operations
Both the server and client in your system will support a variety of messages that influence behavior. You may design your own protocol to implement these operations:
put fileName
get fileName
delete fileName
search string
(note that this string could be blank to search for all files, implementing thelist
functionality. You only need to do basic substring matching.)
If a client tries to put
file that already exists, you can reject the operation. Instead, they should delete the existing file first. Or, if you’re feeling adventurous, you could add an overwrite
operation that automatically does a delete
followed by a put
.
To ensure the system is trustworthy, each of these operations should be acknowledged as either successful or a failure. Users are generally willing to retry storage operations if/when they fail, so it’s better to be explicit about failures.
Handling Replication and Failures
When a client stores a file, the server should ensure that it has been replicated before acknowledging the storage operation as successful. To do this, the first server will contact the replica server and ask it to store the file as well. You must support at least 2 storage servers in this project (meaning every file is stored twice for redundancy), but you are welcome to support higher replication levels if you wish. The current best practice at companies like Google or Amazon is to store 3-5 replicas of every file.
If a replica server goes down, then your only option is to reject put
operations. This may seem unintuitive – why not store the file and wait for the other server to come back up later? At that point, you could synchronize files between the two machines and continue operating normally. However, “synchronize the files” is something that seems relatively simple on the surface but quickly descends into a multitude of edge cases.
However, note that it is completely reasonable to allow get
operations when the system is in a degraded state (one storage server has gone down).
Detecting and Repairing File Corruption
When storing a file, your server should also store its checksum (you can use any hash algorithm you’d like). For example, if you have my_file.txt
you may also store my_file.txt.checksum
in a separate directory. When retrieving a file, you’ll read the file, checksum it, and verify that the checksum matches the original checksum stored on disk. If the file is corrupted, contact a backup server to repair the file.
Tips and Resources
- Log events as they happen in your system! Distributed systems are difficult to debug, so every log message helps.
- You can test your system on a local machine – simply have the storage servers run on different ports and use
localhost
as the hostname. However, it is crucial to test your system in a real-world environment (i.e., storage servers on separate machines and a separate client machine)
Testing and Grading
Since certain people coughMatthewcough are terribly slow and probably unable to implement a robust test suite before the semester ends, we’ll do testing and grading somewhat differently for this assignment.
You’ll set up your system and then perform the following tasks to demonstrate functionality. Point values are shown by each test.
- [2] Store several files.
- [2] Test a search string (should only match files that contain the string).
- [1] List files.
- [3] Retrieve files from both storage servers. Their
sha1sum
should match the original file that was stored.- It is crucial that you test the
sha1sum
of the file upon retrieval. Many file formats, such as JPEG or even text files, tend to be fairly resilient to corruption and appear to work even though their contents have changed slightly.
- It is crucial that you test the
- [1] Kill one of the storage servers (picked randomly), and try to store a file. The operation should be rejected.
- [1] Retrieve a file from a working storage server. This should be allowed even in a degraded state.
- [3] Corrupt a file on one of the storage servers by modifying it, then attempt to retrieve the corrupted file. The corruption should be detected, the file repaired, and the correct data returned to the client.
For the remaining 2 points, you should include a README with detailed instructions on how to set up and use your system. Since this project is less prescriptive, make sure to discuss your design and the logic behind the decisions you made. Creativity is encouraged!
NOTE testing will be performed on our VMs. Make sure your project works on the test environment before turning the project in!
Test Dataset
Find the test dataset that will be used for grading here: p4-dataset.tar.gz. You can download it to your VM with wget
. Here are the files' names, sizes, and sha1 checksums:
$ wget 'https://www.cs.usfca.edu/~mmalensek/cs521/assignments/p4-dataset.tar.gz'
$ tar xf p4-dataset.tar.gz
$ cd p4-dataset
$ for i in *.bin; do stat --printf '%s ' $i; sha1sum $i; done | column -t
3 ca919eca39b3ed092622b8ae0875ddd0d637254e test1.bin
12601476 34a308cf63ae2f20bd061733f3e0c1db6577332f test2.bin
57690244 28b047c55ed0b68df52cf931d973e76aade87545 test3.bin
127944836 4a6d2c9b72e511436b2cf8c075c0a395f4be8de9 test4.bin
719332421 c8166f20e8bdc7d79fb6c7ae36dc170d98abee85 test5.bin
1164986500 2be1550d2d44c578efc1297e17a9652633353a7f test6.bin
You should make sure that the files that are stored (with put
) match these SHA-1 checksums when they are retrieved (with get
). Even if the files look the same, you need to be sure by fingerprinting them with a hash function.
Changelog
- Initial project specification posted (5/5)
- Added test dataset info (5/12)