# CS595 - Introduction to Distributed Computing **Lecturer**: [Boris Glavic](http://www.cs.iit.edu/~glavic/) **Semester**: fall 2021
# 1. Introduction ## Distributed Computing (What / Why)
## What is distributed computing? * Executing programs on a **cluster** of **machines** requiring **communication** among the machines of the cluster.
## Why do we need distributed computing? * **Big data** * Size of data we want to process is larger than the storage capacity of a single machine * **Expensive computation** * Computations that would take too long on a single machine
## Why Examples - **Data size** - Store the Facebook network graph with 100s of millions of users - **Expensive computation** - Compute the PageRank of every webpage on the internet
## What is horizontal scaling? - **Scaling**: how is the runtime of a computational task is affected by the number of nodes we are running on? - Ideally: - *linear scaling*: the tasks runs $n$ times faster on $n$ nodes than on $1$ node
# 1. Introduction ## Challenges Distributed Computing
## Challenges of distributed computing * **Load-balancing**: *How to evenly distribute data and work to the machines in the cluster?* * **Fault tolerance**: *How to deal with failures?* * **Consistency**: *How to syncronize information across nodes?* * **Scalable algorithms**: *How design an algorithm that scales?*
## Load balancing - Consider a computation running distributively on $n$ nodes - How long to finish the computation? - Have to wait for the all nodes to finish! - Want to distribute work evenly to equalize the task execution times of nodes - this is called **load balancing**
## The problem with *stragglers* - Random events (OS noise, intermittent failures) cause nodes to slow down - Events that significantly affect performance are rare - In large clusters the chance that some node will be significantly slower than the others is high!
## Straggler Example - $1000$ node cluster - each node has a $1\%$ chance of taking twice as slow than the expectation - There is a $1 - (0.99^{1000}) \simeq 99.99\%$ chance that at least one node will be slow - Runtime is pretty much guaranteed to be double than what one may expect!
## Moving Data vs. Compute - To run a task $t$ over a piece of data $d$ we have to ... - Move $d$ to the node on which we want to execute $t$ (**data movement**) - Execute $t$ on a node that currently stores $d$ (**bringing compute to data**)
## Fault tolerance - Types of failures in distributed systems: - **node failures** - **network failures**
## Node failures - **intermittent** - *e.g., restart* - data on stable media is not lost, but state of computation is - **permanent** - *e.g., hard drive fails permanently* - all data and computational state on the node is lost
## Network failures - network partition - *e.g., switch fails* - some nodes cannot communicate with each other - => no synchronization possible
## Failures are the norm! - Same problem as with stragglers - each node has a low chance of failure - the chance that some node in the cluster will fail during a computation is high
## Fault tolerance - Build systems that can tolerate / recover form failures - **Fault tolerance techniques** - **Failure detection** - *e.g., heartbeat mechanism* - **Redundancy** - *e.g., replication: each piece of data is stored on more than one node*
## Consistency - Computations may require some **global state** - *e.g., how much items are in stock for an online retailer* - Nodes need to agree on the global state (**consistency**) - **Consensus protocol** achieve this under failures
## CAP - **Consistency**: all nodes agree on global state - **Availability**: the system is available at all times - **Partition tolerance**: the system can handle network partitions
## CAP Theorem - The CAP theorem states: - All three properties (CAP) cannot be guaranteed - It is possible to build systems that guarantee any combination of two of these properties, e.g., CA
## CAP design space - **CA**: *consistent* and *available* - the system is always available and consistent, but cannot tolerate network partitions - **CP**: *consistent* and *partition-tolerant* - the system is always consistent and can tolerate network partitions, but may be unavailable when network partitions occur - **AP**: *available* and *partition-tolerant* - the system is always available and can tolerate network partitions, but may be inconsistent when network partitions occur
## Scalable algorithm design - **Easy example** - take an large array of integers and add a constant to each element - **Harder example** - transitive closure of a graph
# 1. Introduction ## How to build distributed systems
## Building distributed systems is hard - **Many pitfalls** - Maintaining distributed state (consistency) - Fault tolerance - Load balancing - **Requires a lot of background** - OS and networking - Parallel algorithm design & programming - Distributed system's theory (consensus protocols)
## Distributed systems are hard to debug - Non-determinism because of parallelism - *e.g., race conditions* - Failures - Large amount of state - Access to machines over a network
# 1. Introduction ## Why are big data systems successful?
## Big Computations and Data - **Data exceeds single node capacity** - Social network graphs (billions of nodes) - Google's index of the web (petabytes of data) - Scientific applications (particle accelerators produce GBs of data / sec) - **Expensive computations** - Find influential nodes in a large social network - Analyze the clicklogs of all Amazon customers
## Transparency - **Hide the complexities of distributed computation from the user** - **Examples** - **Distributed file systems** - Access the storage of a cluster of nodes as a single logical file system - **Dataflow engines** - User expresses a computation as a simple program
# 1. Introduction ## Outlook - **2. Distributed Storage & NoSQL Databases** - **3. Distributed Batch Processing** - **4. High-level Dataflow & Query Languages** - **5. Distributed Transaction Processing, Consensus, and Consistency** - **6. Distributed Stream Processing & Publish-Subscribe**