Distributed Program Construction
Fall 2002
Lecture 2 : Intro to Design of Dist. Systems
COMP 413
Design Goal: Transparency
Hide distribution, provide
single-image view
- Two levels: user, and application (harder)
Kinds: (ISO-ODP)
- Access - identical operations for accessing
local and remote objects
- Location - enables access to objects without
knowledge of their location
- Migration - objects can move at will
- without changing their names
- without affecting the operation of running
applications (harder)
- Replication - objects can be replicated (for
increasing reliability and performance) without knowledge of the replicas
by the applications
- Concurrency - multiple applications can share
resources automatically
COMP 413
Availability - Fraction of the time that the system is available
- In case of failure, another process/machine can take
over
- Ideal situation - Boolean OR of individual availabilities
- Worse situation - Boolean AND (inferior to a single
machine)
- Techniques:
- Reduce simultaneous functioning of critical components
- Redundancy (sw and hw). But complicates consistency.
- Avoid single point of failure
Fault Tolerance - Recover properly from failures.
- Minimize loss of data and state
- Minimize impact on running applications
- Retain the system consistency
- In distributed system, fault tolerance is more
complex due to:
- Partial failure (distributed state)
- Both network and host failure
- Complex failure modes (Byzantine)
COMP 413
- Major problem: network communication is slow (compared
to intra-machine communication) due to protocol handling, routing (on
WAN) etc.
- But to gain performance through parallism, need to
send messages
- Design considerations:
- Granularity of sub-computation
- Interaction between components (favour coarse-grained
parallelism)
- Overlap computation and communication
- Another problem: reliability conflicts with performance
- e.g., send extra message to a "reserve" server
- Important guideline: identify and avoid bottlenecks.
COMP 413
Design Goal: Scalability
- Build a system that can scale up to many machines/users,
with minimal changes to system.
- Guiding principle: Avoid centralized services,
data, and algorithms
- Centralized service:
- performance bottleneck - processor and network capacities
- large communication overhead, even for "local interaction"
- single-point-of-failure
- Centralized data has similar problems
- Symmetric systems (i.e., any node can play any role)
tend to be most scalable
- Centralized algorithms that require complete information
introduce large communication overhead
- Desirable characteristics of distributed algorithms:
- partial information in each machine
- (some,most) decisions are based on locally available
information
- Failure of a single machine does not kill the algorithm
- There is no assumption about a global clock (synchronization)
- Related scalability - allow operation over WAN
COMP 413
Problems: Faced with an intelligent, malicious attacker,
- maintain integrity of state
- maintain privacy of data
- maintain availability of services (arguably hardest)
- prevent unauthorized use of services
Guiding principles:
- design for security from the start
- separate security policies from mechanisms
- strong authentication and encryption
- tight resource management
- small trusted computing base
- rely on diversity (voting)
COMP 413
Case study: Distributed File Systems
General idea:
- provide file system services
(i.e., storage, access, structure, protection) on multiple machines, transparently
(as on a single machine)
- allow sharing of multiple (heterogeneous)
local file systems
- Distinction between file service
(specification, interface) and file server (a process that implements
part of the file service)
Issues:
- Naming, and directory services
- location transparancy -
servers can move without affecting applications
- location independence - objects
can move without changing their names
- Performance enhancments via
file caching
- problem: cache consistency (e.g.,
dirty read, lost update)
- solutions (expensive):
- Write through, delayed write, session semantics
- Fault tolerance via file
replication
- problem: replica consistency
- solutions: primary copy, voting
- Semantics of file sharing
- Complicated with multiple caches/replicas
- Stateless vs. Stateful servers
- How to exploit file usage patterns
Two examples: NFS and AFS
COMP413
Sun's Network File System (NFS)
- Extension of Unix file systems
(transparent to Unix applications)
- supports its standard system
call interface
- But has an open specification
of file service, and has been implemented by different systems (e.g., Windows).
- Client-server architecture
with 2 elements: directory mount, and file access
- Mount:
- NFS servers exports directories
that they are willing to share with clients
- NFS clients mount remote
directories on a locally-defined path in the file system (location transparency)
- different clients may mount the
same remote directory onto different locations.
- once mounted, access to remote
files is identical to local files (access transparency)
- Mount protocol:
- request: mount(remote-directory)
- reply: file-handle (file-system
type, disk, i-node number of directory, security-info)
- An alternative: automount
- mount on demand, choose from a list (fault tolerance)
COMP413
NFS stateless File Access Protocol
- No OPEN and no CLOSE (server
does not know which files are opened by clients).
- Each read/write contains full
information (including position in file)
- Main advantage: Fault tolerance
(server recovery)
- other advantage: scalability
(no tables kept in server)
- Drawback: information associated
with open files does not exist
- Example: authenticated access:
- early "solution": send user and
group id with the message
- real solution: use cryptography
for authentication
- requires use of external authentication
service (e.g., NIS)
- Example: File locking - requires
external mechanism (lockd)
- forwards remote fcntl locking
requests to remote lock managers
- generates local file locking
operations in response to requests from remote lock managers
- fault tolerance - when rebooted,
recovers locks by contacting remote lock-managers to reclaim their locks
(done by statd)
COMP413
NFS Implementation (in SUN)
- Inside the Unix kernel
- Virtual file system layer shields
programs from NFS
- v-node points to i-node or r-node (remote node)
- mount (user-mode) - contacts remote server,
gets file handle, and calls MOUNT syscall
- kernel constructs v-node and requests NFS client to
create an r-node, pointed to by v-node
open - NFS client gets a file handle from server,
constructs an r-node, and kernel creates a v-node
- File translation is expensive
COMP413
Used extensively to improve performance:
- Server caching is similar to Unix (using buffer cache),
except the server performs write-through (due to partial failure)
- Client caching saves network traffic but introduces
cache coherency problems
- Solution (cache validation):
- Blocks come with a last-modification timestamp
- Validation check per file: if modification time is
newer, invalidate all cached blocks of the file.
- Validation check on open and on each block
request from server
- After the check, blocks are assumed valid for 3 seconds
(30 for directories) - any access after that triggers a validation check.
- Write - modified pages are flushed on close
or client sync
- Still - Unix' one-copy update semantics is not provided
!
- Rely on low probability of simultaneous updates and
small percentage of updates (about 5%)
- Other performance enhancements:
- transfers in large chunks (8K)
- read ahead and delayed write
- But no support for replication
NFS has limited scalability, operational mostly on LAN (due
to heavy network traffic)
For more information, see the NFS specification.
COMP413
Major goal - support for large-scale information sharing
(many users, large networks). Allow users to access their files from any workstation
connected to an AFS.
Approach - In order to minimize network traffic, localize
access to files by whole file caching
- open - download entire file to client, store
in local (disk) cache, and return file descriptor of local file
- close - If updated, upload file to the server
- read and write are unchanged !! use
std. syscalls
Architecture:
- Client machines run a program called venus
(cache manager)
- Server machines run only vice -
a multihreaded file server
- Machines are grouped into clusters - each cluster
has a single Vice.
- Both vice and venus are user-level processes, running
under a slightly modified Unix kernel
- sharead file sub-system is rooted at /afs
- "standard" path names can be used by symbolic links
- Files are grouped into Volumes for ease of location
and movement
- Each file has a 96 bit file identifier (fid),
used between venus and vice
- Venus maps pathname to fids
- Vice accesses files only via fids (new syscalls)
COMP413
Session semantics: coherency addressed only at the level
of open/close, not while files are open.
If file F1 is opened by user u1 on machine M1:
- u2 running on M1 gets unix semantics
- u3 running on M2 gets to see original version of F1
from the Vice - fresh copy can be seen only after close by u1
When open is issued, how does Venus know if a local copy
is valid ?
- Alternative 1 (NFS-like): ask Vice. drawback - network
traffic for every open
- Alternative 2 (AFS):
- Vice records who has a copy of F1. Venus keeps a "valid
bit" for each copy.
- If local copy is changed, it is sent to Vice, who
sends "invalidate" message to all Venus with a copy of F1.
- Next time F1 is opened, venus downloads a new copy
- Main advantage over NFS: network traffic only when
file is updated
- Drawback: server needs to keep state (per file/machine,
not per opened file)
- Fault tolerance support:
- After failure, cached files are marked as invalid
- After T minutes with no comm. with server, an open
is preceeded with validation check
For more information on AFS, click here.
COMP413
xFS: Serverless Distributed Network
File Service
- No servers -- clients cooperate to provide storage
and file services
- Use collective disk and memory of client machines
to provide disk storage and file cache
- Each client is responsible to handling requests for
a subset of the files
- Disk blocks striped across client disks for
high bandwidth
- Striped blocks include redundant information to tolerate
failure of client machines
- Complex design
For more information of xFS, click here
COMP413
Summary: Design Principles for DFS
- Favor client processing over server processing (in
the extreme, no servers)
- Cache whenever possible
- Exploit usage properties (e.g., cache temporary (unshared)
files)
- Minimize global knowledge and change
- Trust fewest possible entities
- Batch work where possible (e.g., few transfers of
large chunks)
COMP413