Distributed Systems and Storage¶

Types of Non-Local Storage¶

SANs - Storage Area Networks
NASs - Network Area Storage
NFS/CIFS - Network File System / Common Internet File System
Cloud Storage / Sync Services
Distributed File Systems
Parallel File Systems

How Do We Evaluate Storage Systems?¶

Latency - How long does a single operation of the smallest fundamental unit take to complete.

Throughput - How many bytes of data per second can we read or write

Parallel scaling - When several requests are issued at the same, or nearly the same time, how are latency and throughput affected.

Resilience to failure of storage components - How many and which types of storage hardware failures can be tolerated without the system availability being interrupted and how is performance affected.

Resilience to failure of network components - If a network is partitioned or fails, does the storage system remain consistent and how are partially complete operations handled

Semantics - What capabilities are offered to clients? Can existing files be re-written? Are there folders? Is random access possible? How are files locked and shared? Are there any transactional semantics?

Location Transparency

Storage Devices¶

Types of permanent devices:
- Magnetic - hard disk, tape, floppy disk
- Optical - CD/DVD/Blu-Ray, Laser Disc, Paper, Punch Cards, Photo Film
- Solid State - CMOS, NAND based flash, battery backed dynamic memory
Types of transient devices:
- RAM, Processor Caches

Storage and Failure¶

An excellent paper about hard disk failure rates at Google is available here:
https://www.usenix.org/legacy/event/fast07/tech/full_papers/pinheiro/pinheiro.pdf
At Google, the failure rates of disks of a given age are:
- 3-months - 2.5
- 6-months - 2.0
- 1-year - 2
- 2-year - 8
- 3-year - 8.5

Maximizing Availability - RAID¶

To prevent the loss of availability of data, the use of RAID (Redundant Array of Inexpensive Disks) allows for redundant copies of data to be stored.
Common RAID levels are:
- RAID 0 - splits data across disks. Increases disk space and provides no redundancy. 2 or more disks are needed.
- RAID 1 - creates an exact copy of data on two or more disks.
- RAID 5/6 - splits data across disks. Uses one or more disks for parity. This allows 1-K out of N disks to fail and allow the data of any lost disk to be recovered. 3 or more disks are needed.

RAID¶

RAID has three common implementation approaches:
- Complete hardware implementation - a disk controller or expansion card implements RAID. Several disks are connected to this controller and it is presented to the operating system as a single storage device. Often have reliability guarantees.
- Partial hardware implementation - Same as the complete hardware implementation, except parity calculations, and buffering are delegated to the host CPU and memory. Don’t often have reliability guarantees.
- Software implementation - The operating system itself manages several disks and presents to the file-system layer a single storage device.

RAID - 0¶

image¶

RAID - 1¶

image¶

RAID - 5¶

image¶

Local Storage¶

At the basis of almost any distributed system are the factors involved in local storage systems.

The problems presented in local storage are simpler and less composed than in their distributed counter-parts.

Implementing Files and Folders¶

How files and folders are implemented in a storage medium can greatly depend upon the physical characteristics and capabilities of that medium.
For example, on tape-drives, CD/DVD/Blu-Ray, or write-once media, files and folders are stored contiguously with no fragmentation. All of the information about the filesystem can be held in a TOC (Table Of Contents).
For filesystems with files that have a finite lifetime, such as on flash media, hard disks, SSDs, and others, the layout of files and folders must be maintained in a more complex way.
Among these more advanced methods are linked lists and i-nodes.
To manage free-space, objects like bit-maps and linked lists are possibilities.

Inodes¶

inodes are the fundamental structures of a UNIX filesystem
inodes have the following attributes:
- File Ownership - user, group
- File Mode - rwx bits for each of user, group, and others
- Last access and modified timestamps
- File size in bytes
- Device id
- Pointers to blocks on the storage device for the file or folder’s contents

Inodes - Indirect Blocks¶

The strategy of using indirect, double indirect, and even triple indirect blocks is a very successful implementation strategy
This approach is used by ext2 / ext3 / ext4 in Linux.

Block Caches¶

To improve the performance of a filesystem, and to make disk scheduling algorithms more realizable, most operating systems implement some kind of block cache.
The block cache allows for read-ahead and write-behind. It also allows for lower latency I/O operations.
With a block cache, the write() system call for instance only needs to complete modifications to the cache before returning. The operating system can complete the operation on disk in a background thread.
Without this cache, the system call would not be able to return until the write had been committed to disk.
Important parameters of any block cache are:
- The size of the cache in physical memory
- The delay before committing ’dirty’ items in the cache to disk
The larger the cache, the better the filesystem will likely perform, but this can come at the cost of available memory for programs.
The larger the delay before writing items to the disk, the better the disk allocation and scheduling decisions the operating system can make.
The shorter the delay before writing to disk, the greater the guarantee in the presence of failure that modifications will be persisted to disk.

Folders and Path Traversal¶

In all but the most simple filesystems, there is a concept of a folder and a path.
In UNIX operating systems, folder entries are held within inodes that have the filetype in the mode set to type directory.
The contents of the inode then are a list of filenames and pointers to the inodes of those files and/or folders.
Resolving paths involve accessing a root folder, and accessing each folder recursively until reaching a file or finding the folder to be invalid.
An example of path traversal. When traversing paths, the path may cross into different filesystems.

Virtual Filesystems / VFS¶

Aside from files and folders there are other things like named pipes, domain sockets, symbolic and hard links that need to be handled by the filesystem.
Rather than have the semantics of these implemented in each filesystem implementation, many OS architectures include a virtual filesystem or VFS.
The VFS stands between the OS kernel and the filesystem implementation.

Virtual Filesystems / VFS¶

The VFS can help adapt both foreign filesystems (such as VFAT) by producing a contract that these implementations can adapt to.
The VFS can also help reduce code duplication between FS implementations by providing common structures and handling shared behavior:
- Path traversal
- Handling named pipes, domain sockets, etc…
- Managing file handles and file locking
- Structures and functions for the block cache.
- Structures and functions for accessing storage devices

Virtual Filesystems and Stacking¶

In some VFS implementations it is possible to stack filesystems on top of each other.
A great example of this in Linux is UMSDOS: the base VFAT filesystem does not have support for users, groups, security or extended attributes. By creating special files on VFAT and then hiding them, UMSDOS can adapt VFAT to be a UNIX-like filesystem
Another great example of this is UnionFS. It allows two filesystems to be transparently overlaid.

Distributed Filesystems¶

Flat file service

implements operations on the contents of file

UFID (Unique File Ids) used to refer to files

new UFID assigned when file created

Directory service

provides mapping between text names and UFIDs

Functions to create, update.. directories

Client module

runs on client computer

provides APIs to access files

holds information about network location of file server and directory server

sometimes caching at client

File Service Model¶

Upload/download model

read/write file operations

entire file transferred to client

requires space on client

Products like SkyDrive and DropBox work like this

Remote Access Interface

large number of operations

seek, changing file attributes, read/write part of file…

does not require space (as much) on client

Directory Service¶

Key issue for distributed file system

whether all clients have the same VIEW of the directory hierarchy

Naming Transparency¶

Location Transparency

path names give no hint as to where the files are located

e.g., /server1/dir1/dir2/X indicates X located on server1 but NOT where server1 is located

Problems? If X needs to be moved to another server (e.g., due to space) - say server2 - programs with strings built in will not work!

Location Independence

files can be moved without changing their names

Three common approaches to file and directory naming

Machine + path naming, such as /machine/path or machine:path (location dependent)

Mounting remote file systems onto the local file hierarchy (location dependent)

A single name space that looks the same on all machines (location independent)

UNIX Semantics¶

A read is always provided with the last write (system enforces absolute time ordering)

UNIX semantic can be achieved by

read/write going to server

no caching of files

sequential processing by server

BUT in distributed systems, this may perform poorly!

How to improve performance?

requires caching

Modify Semantics?

“changes to an open file are initially visible only to the process that modified the file. When file closes, changed become visible to others”

Called Session Semantics

More Semantics¶

Q What is the result of multiple (simultaneous) updates of cached file?

final result depends on who closed last!

one of the results, but which one it is can not be specified (easier to implement)

Immutable Files

can only create and read files

can replace existing file atomically

to modify a file, create new one and replace

what if two try to replace the same file?

what if one is reading while another tries to replace?

Distributed File System Implementation¶

Need to understand file usage (so that)

implement common operations well

achieve efficiency

Satyanarayan (CMU) of file usage pattern on UNIX

System Structure¶

How should the system be organized?

are clients and server different?

how are file and directory services organized?

caching/no caching

server

client

how are updates handled?

sharing semantics?

stateful versus stateless

Directory Service¶

Separate
- (-) requires going to directory servers to map symbolic names onto binary names
- (+) functions are unrelated (e.g., implement DOS directory server and UNIX server- both use same file server
- (+) simpler
- requires more communication
Lookup

Stateless versus Stateful¶

Stateless advantages

Fault tolerance

No OPEN/CLOSE calls needed

No server space wasted on tables

No limits on number of open files

No problem if client crashes

For example,

each request self contained

if server crashes - no information lost

Caching¶

One of the most important design considerations

impacts performance

If caching – how should it be done?

Caching - Server¶

Server Disk

(+) most space

(+) one copy of each file

(+) no consistency problem

(-) performance

each access requires disk access –> server memory –> network –> client memory

Server Memory

keep MRU files in server’s memory

If request satisfied from cache ==> no disk transfer BUT still network transfer

Unit of caching? Whole files

(+) high speed transfer

(-) too much memory

Blocks + better use of space

What to replace when cache full?

LRU

Caching - Client¶

Client Caching

Disk

slower

more space

Memory

less space

faster

Where to cache?

User Address Space

cache managed by system call library

library keeps most heavily used files

when process exits - written back to server

(+) simple

(+) low overhead

(-) effective if file repeatedly used

Kernel

(-) kernel needed in all cases )even for a cache hit)

(+) cache survives beyond process ( e.g., two pass compiler - file from first pass available in cache)

(+) kernel free of file system

(+) more flexible

little control over memory space allocation

e.g., virtual memory may result in disk operation even if cache hit

Client - Cache Consistency¶

client caching introduces inconsistency

one or more writers and multiple readers

Write-thru

similar to between processor cache and memory

when a block modified - immediately sent to server (also kept in cache)

problem

client on machine 1 reds file

modify file (server updated)

client on machine 2 reads and modifies files

server updated

another client on machine 1 reads file

gets local copy (which is stale)

solution: write-thru

cache manager checks with the server before providing file to client

If local copy upto-date

provide to client

Else get from server

RPC for check is not as expensive as file access

Performance problems

read is fine

each write generates network traffic (very expensive)

compromise - periodic updates (say 30 sec) of writes

collected and sent to server

eliminates writing of many scratch files completely (which otherwise would be written)

Note- semantics have changed for delayed writes

Client - Cache Consistency - Other Options¶

Write-on-Close

session semantics

wait (delay - say 30 sec) after close to see if file deleted

in that case write eliminated

Centralized Control

File server keeps track of who has file and in what mode

if new request for read - server checks to see if file opened read/write

if read - grant request

if write - deny request

when file closed - inform others

Many variations possible

Replication¶

Multiple copies of files for

increased reliability so no data is lost

increased availability when one server is down

improved performance through division of load

Replication - Update Protocols¶

send update to each file in sequence

problem - if process updating crashes in the middle ==> inconsistent copies

Primary Copy Replication

one server designed as primary

primary updated (changes made locally by primary server)

primary server updates secondary copies

reads can be done from any copy

to guard against primary copy failure

updates first stored on stable storage

But if primary copy down - No update can be made!!

Replication - Voting Algorithm¶

Requires clients to acquire permission of multiple servers before reading/writing file

File replicated on N servers - to update client needs to contact majority , N/2 + 1 servers

File changed and new version no assigned

To read - client contacts N/2 + 1 servers

will always get the latest version

Replication - General Quorum Algorithm¶

No of replicas - N

Read Quorum - Nr

Write Quorum - Nw

Constraints Nr + Nw > N

Read/write requires participation of the corresponding quorum

Case Study - SUN NFS¶

NFS - Network File System

designed to allow an arbitrary collection of clients and servers to share a common file system

Design Goals

heterogeneity

access transparency

local and remote accesses look the same - e.g., normal UNIX system calls

failure transparency

stateless

idempotent

performance transparency

client caching

server caching

Location Transparency

client establishes file name space by adding remote file systems to local name space

file system exported by servers (node holding it)

file system remote-mounted by client

Not Supported in Design Goals

Replication transparency

separate service for replication (NIS)

Concurrency

Naïve locking

Scalability

limited

originally designed to support 5-10 clients

SUN NFS - Implementation¶

VFS Layer

maintains table with one entry for each open file

entry called v-node (indicates whether local or remote)

v-node points to I-node (for local files) and r-node (for remote files)

Typical Operation

Read

locate v-node

determine local or remote

transfer occurs in 8K (normally) byte blocks

automatic prefetching (read-ahead) of next block

Write

writes not immediately written to server

8K bytes collected before writing

Caching

server caches (to reduce disk accesses)

client maintains cache for

for file attributes (I-nodes)

for file data

cache block consistency problems

with each cache block is a timer

entry discarded when timer expired

when file opened- server checked for last modification of file

UNIX semantics not completely enforced

Distributed Systems and Storage¶

Types of Non-Local Storage¶

How Do We Evaluate Storage Systems?¶

Storage Devices¶

Storage and Failure¶

Maximizing Availability - RAID¶

RAID¶

RAID - 0¶

RAID - 1¶

RAID - 5¶

Local Storage¶

Implementing Files and Folders¶

Inodes¶

Inodes - Indirect Blocks¶

Block Caches¶

Folders and Path Traversal¶

Virtual Filesystems / VFS¶

Virtual Filesystems / VFS¶

Virtual Filesystems and Stacking¶

Distributed Filesystems¶

File Service Model¶

Directory Service¶

Naming Transparency¶

File Sharing Semantics¶

UNIX Semantics¶

More Semantics¶

Distributed File System Implementation¶

System Structure¶

Directory Service¶

Stateless versus Stateful¶

Caching¶

Caching - Server¶

Caching - Client¶

Client - Cache Consistency¶

Client - Cache Consistency - Other Options¶

Replication¶

Replication - Update Protocols¶

Replication - Voting Algorithm¶

Replication - General Quorum Algorithm¶

Case Study - SUN NFS¶

SUN NFS - Implementation¶