# CS595 - Storage - Distributed File Systems **Lecturer**: [Boris Glavic](http://www.cs.iit.edu/~glavic/) **Semester**: fall 2021
# 2. Distributed Storage ## Distributed File Systems
## Distributed File System - Store files in a cluster of machines - Manage **meta-data** (file-system structure, permissions, ...) and **data** (file content) - **Operations:** - Create / delete files - Directory operations - Read file (sequentially / random access) - Write file (append / random access)
## Requirements - Support files larger than storage of a single machine - **Fault Tolerance** - **Data loss**: Do not loose data when a node in the cluster fails - **Availability**: data should be accessible even under - network failures - node failures
## Requirements cont. - **Load balancing** - Distribute file system operations across the cluster - Balance operations across the cluster - **Transparency** - clients do not need to decide on data distribution - clients do not need to handle fault tolerance
## Fault tolerance - Data loss - If data is stored on only one node, then data loss cannot be prevented - => Each piece of data has to be stored on multiple nodes (**replication**) or at least some additional information has to be stored on other nodes to enable recovery of lost data (e.g., erasure coding)
## Replication - Each piece of data is replicated across $m$ nodes - How to choose number of replicas? - How to keep replicas in sync (consistency)? - How to detect missing replicas and compensate for that?
## How to chooose the number of replicas? - Larger $m$ - Wasting storage - Decreased chance of data loss - Smaller $m$ - Less wasted storage - Higher chance of data loss - Sample data point: - 2-3-way replication sufficient for 99.9% reliability
## Characteristics of Replication - **Read performance** - If all replicas are kept up to sync, then we can read from multiple replicas in parallel - $m$-way replication improves read through-put by a factor of $m$
## Characteristics of Replication cont. - **Write performance** - Have to write to all replicas - In addition to syncronization overhead (consistency) this causes $m$-times the write load - **Storage requirements** - increased by a factor of $m$
## Fault tolerance - Network issues - Replication can help too - Need to be aware of network infrastructure - Do not place all replicas on nodes that are connected to the same switch
## Data Placement - How to balance data distribution across a cluster? - At a file level? - High computational complexity (NP-hard) - Split files into blocks - Distribute individual blocks - What is a good block size?
## Data Placement cont. - How to determine which block goes where? - e.g., hash function - if number of blocks is large enough that almost guarantees even distribution for the right choice of hash functions
## Meta-data Management - **Dedicated nodes for meta-data management** - A subsets of $m < n$ nodes in the cluster handle metadata management - **Truely distributed meta-data management** - All nodes participate in meta-data management
## Consistency - Allow full parallel access to files for random reads and writes - Readers and writers need to synchronize for all operations - or a weaker consistency model, e.g., eventual consistency, has to be applied - Essentially same problems as in transaction processing - => Flexible, but requires complicated and expensive strong consistency
## Consistency - Limit file operations and/or restricting concurrent access - e.g., append-only or no modification after creation - only one writer per file at a time
# 2. Distributed Storage ## Hadoop Distributed Filesystem (HDFS)
## HDFS - Open-source distributed file system - Modelled after Google's Distributed Filesystem (GFS) - Written in Java - Optimized for storage of large files
## File System Structure - An HDFS file system is made of **inodes** (directories or files) which have associated **metadata** (e.g., permissions) - Files consist of one or more **blocks** - The block size is much larger than on single node file systems (e.g., 128MB) - Some blocks may be smaller than the block size - *200MB file: 128MB + 72MB block* - *4KB file: 4KB block*
## Architecture - **Name node** - node in the cluster responsible for storing filesystem meta-data - directory structure - inode metadata (permissions, ...) - which blocks belong to which files - handles client requests for FS metadata - single name node per cluster (possible hot or cold stand-bys)
## Architecture - **Data node** - stores file content (blocks and block metadata) - clients directly communicating with data nodes for reading/writing - All nodes in the cluster except for name nodes (and potentially some other exceptions discussed later) are data nodes
G
cluster_0
Name Node
cluster_1
Data Node
cluster_2
Data Node
cluster_3
Data Node
f2
File2: B3
f1
File1: B1, B2
b11
Block 1
b12
Block 2
b23
Block 3
b21
Block 1
b32
Block 2
b33
Block 3
## Fault tolerance - data loss - Each block is replicated to multiple data nodes (2-3 is typical) - Name node tracks which data nodes store which blocks - If a replica of block `b` is lost (e.g., node failure) then the name node instructs a data node storing `b` to send the block to another data node to restore the desired number of replicas
## Example - Restoring Replication Count - For this example assume that data node 3 has failed and we are using 2-way replication - blocks 2 and 3 have to be replicated once more to restore that there exist 2 replicas for each
G
cluster_0
Name Node
cluster_3
Data Node 3 (Failed)
cluster_1
Data Node 1
cluster_2
Data Node 2
f2
File2: B3
b32
Block 2
b12
Block 2
b21
Block 1
f1
File1: B1, B2
b33
Block 3
b12->b21
transfer
b11
Block 1
b23
Block 3
b23->b11
transfer
## Fault tolerance - node failure - data nodes send **heart beat** messages to name node every 3 seconds - piggyback storage utilization and load stats - If the name nodes has not received a heart beat from a data node in a certain amount of time, it schedules the creation of new replicas of the blocks stored on this node
G
nn
Name Node
d1
Data Node 1
d1->nn
Heartbeat
d2
Data Node 2
d2->nn
Heartbeat
d3
Data Node 3
d3->nn
Heartbeat
## Fault tolerance - network failures - HDFS's block replica placement strategy is network topology aware - Replicas of blocks are stored on separate racks to avoid loss of availability when a switch connecting the nodes in a rack to the cluster is down
G
cluster_0
Rack 1
cluster_1
Rack 2
i
interconnect
s1
Switch
i->s1
s2
Switch
i->s2
n1
Node 1
s1->n1
n2
Node 2
s1->n2
n3
Node 3
s2->n3
n4
Node 4
s2->n4
## Fault tolerance - data corruption - when a client writes or reads a block, it computes a **checksum** - data nodes store the checksums for blocks and send them to clients reading the block - if the checksum computed based on a read block is different from the stored checksum, then this replica is dropped
## Reading - To read a file, the client sends a request to the name node - the name node sends the client - the list of blocks for the file - the location of all replicas for each block - the client then contacts data nodes directly to read blocks
G
cluster_1
Data Node 1
cluster_0
Name Node
cluster_2
Data Node 2
b12
Block 2
b11
Block 1
f1
File1: B1, B2
f2
File2: B3
client
Client
f1->client
metadata
b23
Block 3
b21
Block 1
client->b12
read
client->f1
client->b21
read
## Writing & Consistency - Writing is append only in HDFS - At any time, only one client can write a file - The name node maintains locks for each file and declines write requests by clients that want to write a file that is currently locked
## Writing a Block - A client holding a lock for a file contacts the name node for each block it is writing to get the list of data nodes that will store replicas of the block - A pipeline is established between the client and the data nodes that will store replicas - Data is send in smaller chunks - from client to first data node - from first data node to second data node - ...
blockdiag
seqdiag { Client => "Data Node 1" => "Data Node 2" [label = "setup", return =""]; Client => "Data Node 1" => "Data Node 2" [label = "block 1", return =""]; Client => "Data Node 1" => "Data Node 2" [label = "...", return =""]; Client => "Data Node 1" => "Data Node 2" [label = "block n", return =""]; Client => "Data Node 1" => "Data Node 2" [label = "close", return =""]; }
Client
Data Node 1
Data Node 2
setup
setup
block 1
block 1
...
...
block n
block n
close
close
## Concurrent Writing and Reading - blocks written to a file are hidden from readers until either - the client closes the file - the client issues a flush operation - in either case the name nodes updates the file metadata to include the new blocks
## Metadata operations - the name node keeps the file system metadata in memory - periodically a snapshot is written to disk - changes to metadata are ... - not directly applied to the in memory copy of the metadata - written to journal (WAL) - taking a snapshot - apply journal to copy of metadata and write to disk
## Name Node Stand-by - WAL enables to keep a stand-by name node - recover from name node failures - The name node sends journal entries to the stand-by which applies the journal to its memory snapshot - Can also be used to reduce load on the name node by outsourcing of taking snapshots
## HDFS Summary - **fault tolerance**: replication - detection of failures through heart beat - replica placement is topology-aware - **metadata management**: - single node (name node) - **consistency**: - prevent concurrent writes - changes to file are atomic (from a reader's perspective)
## Advantages & Disadvantages - **Advantages** - well suited for batch processing - decent read performance - fault tolerant
## Advantages & Disadvantages - **Disadvantages** - Name node is a bottle-neck for metadata operations - Inflexible write operations - relatively poor write performance - No concurrent writes - No semantic data placement (more on this later)