Home Syllabus Schedule Assignments

Research Papers

Storage Systems

Petabyte-Scale Row-Level Operations in Data Lakehouses
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
The Hadoop Distributed File System
Megastore: Providing Scalable, Highly Available Storage for Interactive Services
PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database
PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory
Fast key-value stores: An idea whose time has come and gone
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Dynamo: Amazon’s Highly Available Key-value Store
Spanner: Google’s Globally-Distributed Database
Bigtable: A Distributed Storage System for Structured Data
IPFS - Content Addressed, Versioned, P2P File System
Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases

Computational Frameworks

Exoshuffle: An Extensible Shuffle Architecture
MapReduce: Simplified Data Processing on Large Clusters
Dremel: Interactive Analysis of Web-Scale Datasets
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Spark: Cluster Computing with Working Sets
Spark SQL: Relational Data Processing in Spark
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Twister: A Runtime for Iterative MapReduce
WarpFlow: Exploring Petabytes of Space-Time Data
Big Data normalization for massively parallel processing databases

Cluster Management

ZooKeeper: Wait-free coordination for Internet-scale systems
The Chubby lock service for loosely-coupled distributed systems
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Large-scale cluster management at Google with Borg

Streaming Data and Data Representation

Storm @Twitter
Apache Flink: Stream and Batch Processing in a Single Engine
Scaling Big Data Mining Infrastructure: The Twitter Experience
Thrift: Scalable Cross-Language Services Implementation

Algorithms

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
Algorithmic Nuggets in Content Delivery
Cuckoo Filter: Practically Better Than Bloom
Less Hashing, Same Performance: Building a Better Bloom Filter
Automatically Generating Interesting Facts from Wikipedia Tables
The PageRank Citation Ranking: Bringing Order to the Web
Random Sampling with a Reservoir

Machine Learning

Efficient Memory Management for Large Language Model Serving with PagedAttention
Shade: Enable Fundamental Cacheability for Distributed Deep Learning Training
SageDB: A Learned Database System
TensorFlow: A system for large-scale machine learning
Highly accurate protein structure prediction with AlphaFold