Distributed Systems and Storage

Types of Non-Local Storage

  1. SANs - Storage Area Networks

  2. NASs - Network Area Storage

  3. NFS/CIFS - Network File System / Common Internet File System

  4. Cloud Storage / Sync Services

  5. Distributed File Systems

  6. Parallel File Systems

How Do We Evaluate Storage Systems?

  • Latency - How long does a single operation of the smallest fundamental unit take to complete.

  • Throughput - How many bytes of data per second can we read or write

  • Parallel scaling - When several requests are issued at the same, or nearly the same time, how are latency and throughput affected.

  • Resilience to failure of storage components - How many and which types of storage hardware failures can be tolerated without the system availability being interrupted and how is performance affected.

  • Resilience to failure of network components - If a network is partitioned or fails, does the storage system remain consistent and how are partially complete operations handled

  • Semantics - What capabilities are offered to clients? Can existing files be re-written? Are there folders? Is random access possible? How are files locked and shared? Are there any transactional semantics?

  • Location Transparency

Storage Devices

  • Types of permanent devices:

    • Magnetic - hard disk, tape, floppy disk

    • Optical - CD/DVD/Blu-Ray, Laser Disc, Paper, Punch Cards, Photo Film

    • Solid State - CMOS, NAND based flash, battery backed dynamic memory

  • Types of transient devices:

    • RAM, Processor Caches

Storage and Failure

Maximizing Availability - RAID

  • To prevent the loss of availability of data, the use of RAID (Redundant Array of Inexpensive Disks) allows for redundant copies of data to be stored.

  • Common RAID levels are:

    • RAID 0 - splits data across disks. Increases disk space and provides no redundancy. 2 or more disks are needed.

    • RAID 1 - creates an exact copy of data on two or more disks.

    • RAID 5/6 - splits data across disks. Uses one or more disks for parity. This allows 1-K out of N disks to fail and allow the data of any lost disk to be recovered. 3 or more disks are needed.


  • RAID has three common implementation approaches:

    • Complete hardware implementation - a disk controller or expansion card implements RAID. Several disks are connected to this controller and it is presented to the operating system as a single storage device. Often have reliability guarantees.

    • Partial hardware implementation - Same as the complete hardware implementation, except parity calculations, and buffering are delegated to the host CPU and memory. Don’t often have reliability guarantees.

    • Software implementation - The operating system itself manages several disks and presents to the file-system layer a single storage device.

RAID - 0



RAID - 1



RAID - 5



Local Storage

  • At the basis of almost any distributed system are the factors involved in local storage systems.

  • The problems presented in local storage are simpler and less composed than in their distributed counter-parts.

Implementing Files and Folders

  • How files and folders are implemented in a storage medium can greatly depend upon the physical characteristics and capabilities of that medium.

  • For example, on tape-drives, CD/DVD/Blu-Ray, or write-once media, files and folders are stored contiguously with no fragmentation. All of the information about the filesystem can be held in a TOC (Table Of Contents).

  • For filesystems with files that have a finite lifetime, such as on flash media, hard disks, SSDs, and others, the layout of files and folders must be maintained in a more complex way.

  • Among these more advanced methods are linked lists and i-nodes.

  • To manage free-space, objects like bit-maps and linked lists are possibilities.


  • inodes are the fundamental structures of a UNIX filesystem

  • inodes have the following attributes:

    • File Ownership - user, group

    • File Mode - rwx bits for each of user, group, and others

    • Last access and modified timestamps

    • File size in bytes

    • Device id

    • Pointers to blocks on the storage device for the file or folder’s contents

Inodes - Indirect Blocks

  • The strategy of using indirect, double indirect, and even triple indirect blocks is a very successful implementation strategy

  • This approach is used by ext2 / ext3 / ext4 in Linux.


Block Caches

  • To improve the performance of a filesystem, and to make disk scheduling algorithms more realizable, most operating systems implement some kind of block cache.

  • The block cache allows for read-ahead and write-behind. It also allows for lower latency I/O operations.

  • With a block cache, the write() system call for instance only needs to complete modifications to the cache before returning. The operating system can complete the operation on disk in a background thread.

  • Without this cache, the system call would not be able to return until the write had been committed to disk.

  • Important parameters of any block cache are:

    • The size of the cache in physical memory

    • The delay before committing ’dirty’ items in the cache to disk

  • The larger the cache, the better the filesystem will likely perform, but this can come at the cost of available memory for programs.

  • The larger the delay before writing items to the disk, the better the disk allocation and scheduling decisions the operating system can make.

  • The shorter the delay before writing to disk, the greater the guarantee in the presence of failure that modifications will be persisted to disk.

Folders and Path Traversal

  • In all but the most simple filesystems, there is a concept of a folder and a path.

  • In UNIX operating systems, folder entries are held within inodes that have the filetype in the mode set to type directory.

  • The contents of the inode then are a list of filenames and pointers to the inodes of those files and/or folders.

  • Resolving paths involve accessing a root folder, and accessing each folder recursively until reaching a file or finding the folder to be invalid.

  • An example of path traversal. When traversing paths, the path may cross into different filesystems.


Virtual Filesystems / VFS

  • Aside from files and folders there are other things like named pipes, domain sockets, symbolic and hard links that need to be handled by the filesystem.

  • Rather than have the semantics of these implemented in each filesystem implementation, many OS architectures include a virtual filesystem or VFS.

  • The VFS stands between the OS kernel and the filesystem implementation.

Virtual Filesystems / VFS

  • The VFS can help adapt both foreign filesystems (such as VFAT) by producing a contract that these implementations can adapt to.

  • The VFS can also help reduce code duplication between FS implementations by providing common structures and handling shared behavior:

    • Path traversal

    • Handling named pipes, domain sockets, etc…

    • Managing file handles and file locking

    • Structures and functions for the block cache.

    • Structures and functions for accessing storage devices

Virtual Filesystems and Stacking

  • In some VFS implementations it is possible to stack filesystems on top of each other.

  • A great example of this in Linux is UMSDOS: the base VFAT filesystem does not have support for users, groups, security or extended attributes. By creating special files on VFAT and then hiding them, UMSDOS can adapt VFAT to be a UNIX-like filesystem

  • Another great example of this is UnionFS. It allows two filesystems to be transparently overlaid.

Distributed Filesystems

  • Flat file service
    • implements operations on the contents of file

    • UFID (Unique File Ids) used to refer to files

    • new UFID assigned when file created

  • Directory service
    • provides mapping between text names and UFIDs

    • Functions to create, update.. directories

  • Client module
    • runs on client computer

    • provides APIs to access files

    • holds information about network location of file server and directory server

    • sometimes caching at client

File Service Model

  • Upload/download model
    • read/write file operations

    • entire file transferred to client

    • requires space on client

    • Products like SkyDrive and DropBox work like this

  • Remote Access Interface
    • large number of operations
      • seek, changing file attributes, read/write part of file…

      • does not require space (as much) on client

Directory Service

  • Key issue for distributed file system
    • whether all clients have the same VIEW of the directory hierarchy

Naming Transparency

  • Location Transparency
    • path names give no hint as to where the files are located

    • e.g., /server1/dir1/dir2/X indicates X located on server1 but NOT where server1 is located

    • Problems? If X needs to be moved to another server (e.g., due to space) - say server2 - programs with strings built in will not work!

  • Location Independence
    • files can be moved without changing their names

  • Three common approaches to file and directory naming
    • Machine + path naming, such as /machine/path or machine:path (location dependent)

    • Mounting remote file systems onto the local file hierarchy (location dependent)

    • A single name space that looks the same on all machines (location independent)

File Sharing Semantics

  • When files are shared (and one or more write) what are the semantics?

UNIX Semantics

  • A read is always provided with the last write (system enforces absolute time ordering)

  • UNIX semantic can be achieved by
    • read/write going to server

    • no caching of files

    • sequential processing by server

    • BUT in distributed systems, this may perform poorly!

  • How to improve performance?
    • requires caching

  • Modify Semantics?
    • “changes to an open file are initially visible only to the process that modified the file. When file closes, changed become visible to others”

    • Called Session Semantics

More Semantics

  • Q What is the result of multiple (simultaneous) updates of cached file?
    1. final result depends on who closed last!

    2. one of the results, but which one it is can not be specified (easier to implement)

  • Immutable Files
    • can only create and read files

    • can replace existing file atomically

    • to modify a file, create new one and replace

    • what if two try to replace the same file?

    • what if one is reading while another tries to replace?

Distributed File System Implementation

  • Need to understand file usage (so that)
    • implement common operations well

    • achieve efficiency

  • Satyanarayan (CMU) of file usage pattern on UNIX

System Structure

  • How should the system be organized?
    • are clients and server different?

    • how are file and directory services organized?

    • caching/no caching
      • server

      • client

    • how are updates handled?

    • sharing semantics?

    • stateful versus stateless

Directory Service

  • Separate
    • (-) requires going to directory servers to map symbolic names onto binary names

    • (+) functions are unrelated (e.g., implement DOS directory server and UNIX server- both use same file server

    • (+) simpler

    • requires more communication

  • Lookup

Stateless versus Stateful

  • Stateless advantages
    • Fault tolerance

    • No OPEN/CLOSE calls needed

    • No server space wasted on tables

    • No limits on number of open files

    • No problem if client crashes

  • For example,
    • each request self contained

    • if server crashes - no information lost


  • One of the most important design considerations
    • impacts performance

    • If caching – how should it be done?

Caching - Server

  • Server Disk
    • (+) most space

    • (+) one copy of each file

    • (+) no consistency problem

    • (-) performance
      • each access requires disk access –> server memory –> network –> client memory

  • Server Memory
    • keep MRU files in server’s memory

    • If request satisfied from cache ==> no disk transfer BUT still network transfer

      1. Unit of caching? Whole files
        • (+) high speed transfer

        • (-) too much memory

    • Blocks + better use of space

      1. What to replace when cache full?
        • LRU

Caching - Client

  • Client Caching

  • Disk
    • slower

    • more space

  • Memory
    • less space

    • faster

  • Where to cache?

  • User Address Space
    • cache managed by system call library

    • library keeps most heavily used files

    • when process exits - written back to server

    • (+) simple

    • (+) low overhead

    • (-) effective if file repeatedly used

  • Kernel
    • (-) kernel needed in all cases )even for a cache hit)

    • (+) cache survives beyond process ( e.g., two pass compiler - file from first pass available in cache)

    • (+) kernel free of file system

    • (+) more flexible

    • little control over memory space allocation
      • e.g., virtual memory may result in disk operation even if cache hit

Client - Cache Consistency

  • client caching introduces inconsistency
    • one or more writers and multiple readers

  • Write-thru
    • similar to between processor cache and memory

    • when a block modified - immediately sent to server (also kept in cache)

  • problem
    • client on machine 1 reds file
      • modify file (server updated)

    • client on machine 2 reads and modifies files
      • server updated

    • another client on machine 1 reads file
      • gets local copy (which is stale)

  • solution: write-thru
    • cache manager checks with the server before providing file to client

    • If local copy upto-date
      • provide to client

    • Else get from server

    • RPC for check is not as expensive as file access

  • Performance problems
    • read is fine

    • each write generates network traffic (very expensive)

    • compromise - periodic updates (say 30 sec) of writes

    • collected and sent to server

    • eliminates writing of many scratch files completely (which otherwise would be written)

  • Note- semantics have changed for delayed writes

Client - Cache Consistency - Other Options

  • Write-on-Close
    • session semantics

    • wait (delay - say 30 sec) after close to see if file deleted
      • in that case write eliminated

  • Centralized Control
    • File server keeps track of who has file and in what mode

    • if new request for read - server checks to see if file opened read/write

    • if read - grant request

    • if write - deny request

    • when file closed - inform others

    • Many variations possible


  • Multiple copies of files for
    • increased reliability so no data is lost

    • increased availability when one server is down

    • improved performance through division of load

Replication - Update Protocols

  • send update to each file in sequence
    • problem - if process updating crashes in the middle ==> inconsistent copies

  • Primary Copy Replication
    • one server designed as primary

    • primary updated (changes made locally by primary server)

    • primary server updates secondary copies

    • reads can be done from any copy

    • to guard against primary copy failure
      • updates first stored on stable storage

    • But if primary copy down - No update can be made!!

Replication - Voting Algorithm

  • Requires clients to acquire permission of multiple servers before reading/writing file

  • File replicated on N servers - to update client needs to contact majority , N/2 + 1 servers

  • File changed and new version no assigned

  • To read - client contacts N/2 + 1 servers
    • will always get the latest version

Replication - General Quorum Algorithm

  • No of replicas - N

  • Read Quorum - Nr

  • Write Quorum - Nw

  • Constraints Nr + Nw > N

  • Read/write requires participation of the corresponding quorum

Case Study - SUN NFS

  • NFS - Network File System
    • designed to allow an arbitrary collection of clients and servers to share a common file system

  • Design Goals
    • heterogeneity

    • access transparency

    • local and remote accesses look the same - e.g., normal UNIX system calls

    • failure transparency

    • stateless

    • idempotent

    • performance transparency

    • client caching

    • server caching

  • Location Transparency
    • client establishes file name space by adding remote file systems to local name space

    • file system exported by servers (node holding it)

    • file system remote-mounted by client

  • Not Supported in Design Goals
    • Replication transparency
      • separate service for replication (NIS)

    • Concurrency
      • Naïve locking

    • Scalability
      • limited

      • originally designed to support 5-10 clients

SUN NFS - Implementation

  • VFS Layer
    • maintains table with one entry for each open file

    • entry called v-node (indicates whether local or remote)

    • v-node points to I-node (for local files) and r-node (for remote files)

  • Typical Operation
    • Read
      • locate v-node

      • determine local or remote

      • transfer occurs in 8K (normally) byte blocks

      • automatic prefetching (read-ahead) of next block

    • Write
      • writes not immediately written to server

      • 8K bytes collected before writing

  • Caching
    • server caches (to reduce disk accesses)

    • client maintains cache for
      • for file attributes (I-nodes)

      • for file data

    • cache block consistency problems
      • with each cache block is a timer

      • entry discarded when timer expired

      • when file opened- server checked for last modification of file

    • UNIX semantics not completely enforced