Distributed Program Construction
Fall 1999
Lecture 9: Distributed Algorithms
Reading:
-
A. Tanenbaum: Distributed Operating Systems. Prentice
Hall, Englewood Cliffs, N.J ., 1995.
-
L. Lamport: “Time , Clocks, and the Ordering of Events
in a Distributed System. ”CACM, 21(7):558–565, July 1978.
COMP 413
Clock Synchronization
-
Physical clocks
-
Logical clocks
-
Vector clock
COMP 413
Problem: Sometimes we simply need the exact
time, not just an ordering.
Solution: Universal Coordinated Time (UTC):
-
Based on the number of transitions per second of the cesium
133 atom (pretty accurate).
-
At present, the real time is taken as the average of some
50 cesium clocks around the world.
-
Introduces a leap second from time to time to compensate
that days are getting longer.
UTC is broadcasted through shortwave radio and satellite.
Satellites can give an accuracy of about +-0.5 ms.
Question: Does this solve all our problems? Don't
we now have some global timing mechanism?
COMP 413
Problem: Suppose we have a distributed system with
a UTC-receiver on some node ==> we still have to distribute its time to
each machine.
Basic principle:
-
Every machine has a timer that generates an interrupt H
times per second.
-
There is a clock in machine p that ticks on each timer
interrupt. Denote the value of that clock by Cp (t), where t
is UTC time.
-
Ideally, we have for each machine p, Cp (t) = t,
or, in other words , dC/dt = 1.
In practice: 1-p <= dC/dt <= 1+p
Goal: Never let two clocks in any system differ
by more than d time units => synchronize at least every d/(2p)
seconds.
COMP 413
Clock Synchronization Principles
Principle I: Every machine
asks a time server for the accurate time at least once every d/(2p)
seconds.
Okay, but you need an accurate measure
of round trip delay, including interrupt handling and processing incoming
messages.
Principle II: Let the time
server scan all machines periodically, calculate an average, and inform
each machine how it should adjust its time relative to its
present time.
Okay , you’ll probably get every
machine in sync. Note: you don’t even need to propagate UTC time (why not?)
Fundamental problem: You’ll
have to take into account that setting the time back is never allowed
=>
smooth adjustments.
COMP 413
The Happened-Before Relationship
Problem: We first need to introduce a notion
of ordering before we can order anything. The happened-before relation
on the set of events in a distributed system is the smallest relation satisfying:
-
If a and b are two events in the same process,
and a comes before b, then a -> b.
-
If a denotes the sending of a message and b
the receipt of that message, then a -> b.
-
If a -> b and b -> c, then a -> c.
Note: This defines a partial ordering of events in
a system with concurrently executing processes.
COMP 413
Problem: How do we maintain a global view of the system's
behavior that is consistent with the happened-before relationship?
Solution: attach a timestamp C(e) to each
event e, satisfying the following properties:
-
P1: If a and b are two events in the
same process, and a -> b, then C(a) < C(b).
-
P2: If a corresponds to sending a message m,
and b to the receipt of that message, then C(a) < C(b).
Problem: How to attach a timestamp to an event when
there is no global clock? => maintain a consistent set of
logical clocks, one for each node.
COMP 413
Each node Ni maintains a local counter Ci,
and adjusts this counter according to the following rules:
-
(1) Between any two successive events within Ni, increment
Ci
by one.
-
(2) When a message m is sent by Ni, the message
receives a timestamp Tm <- Ci.
-
(3) When a message m is received by node Nj,
Cj
is
adjusted as Cj <- max{Cj, Tm+1}.
Property P1 is satisfied by (1); property P2
by (2) and (3).
COMP 413
COMP 413
Total Ordering With Logical Clocks
Problem: It can happened that two event happen
at the same (logical) time. Avoid this by attaching a node number to an
event:
Pi timestamps each event e
with Ci(e).i
Then: Ci(a).i before Cj(b).j iff:
-
Ci(a) < Cj(a), or
-
Ci(a) = Cj(b) and i < j
COMP 413
Extension to Multicasting: Vector Clocks
-
Each node Ni has an array VT(Ni)[1..n], where
VT(Ni)[k]
denotes the number of messages from Nk that have been delivered
at Ni.
-
When Ni sends a message m, it add 1
to VT(Ni)[i], and send VT(Ni) along with m. Result:
upon arrival, each other node knows Ni's vector timestamp.
-
When a node Nj receives m from Ni, with vector
timestamp VT(m), it delays delivery until:
-
VT(m)[i] = VT(Nj)[i] + 1
-
VT(m)[k] <= VT(Nj)[k] for
k <> i
-
When a node Nj delivers a message from Ni with
vector time VT(m) it adjusts each VT(Nj)[k] to max{VT(Nj)[k],VT(m)[k]}
Result is causally ordered multicast delivery!
COMP 413
Mutual Exclusion
Problem: A number of nodes in a distributed system
want exclusive access to some resource.
Basic solutions:
-
via centralized server
-
completely distributed, with no topology imposed
-
completely distributed, making use of a (logical) ring
COMP 413
Mutual Exclusion: Lamport
-
Node Ni sends request (totally
ordered) timestamped
REQi to all others. The request is put into
a local queue Qi
-
Any incoming message at Nj is
queued in Qj according to its timestamp. When a request from Ni
is received, Nj sends timestamped acknowledgment ACKj
-
Ni gets access to the resource
when (1) REQi is at the head of Qi, and (2) all acknowledgment
have been received.
-
Ni releases resource by sending
timestamped RELi to others, and removes its own request from local
queue
-
When release RELi arrives at
node Nj, Nj removes request REQi from its local queue
Qj
COMP 413
Mutual Exclusion: Ricart &
Agrawala
Principle: The same as Lamport,
except that acknowledgments aren't sent. Instead, replies (i.e, grants)
are sent only when
-
the receiving node has no interest
in the shared resource, or
-
the receiving node is waiting for the
shared resource, but has lower priority (as indicated by timestamps)
In all other cases, reply is deferred
until
the resource is released
-
if the receiving node holds the resource
-
if the receiving node is waiting for
the resource and has higher priority (as indicated by timestamps)
COMP 413
Election Algorithms
Principle: An algorithm requires that some process
acts as a coordinator. The question is how to select this special process
dynamically.
Note: In many systems the coordinator is chosen
by hand (e.g. file servers). This leads to centralized solutions => single
point of failure.
Question: If a coordinator is chosen dynamically,
to what extent can we speak about a centralized or distributed solution?
Question: Is a fully distributed solution, i.e
. one without a coordinator, always more robust than any centralized/coordinated
solution?
COMP 413
Election by Bullying
Principle: Each process has an associated priority
(weight). The process with the highest priority should always be elected
as the coordinator.
Issue: How do we find the heaviest process?
-
Any process can just start an election, by sending an election
message to all other processes (assuming you don’t know the weights of
the others).
-
If a process Pheavy receives an election message from
a lighter process Plight, it takes over the election and sends a
message to Plight. Plight is out of the race.
-
If a process doesn’t get a take-over message back, it wins,
and sends a victory message to all other processes
COMP 413
Election by Bullying
Question: We’re assuming something very important here
– what?
COMP 413
Election in a Ring
Principle: Process priority is obtained by organizing
processes into a (logical) ring. Process with the highest priority should
be elected as coordinator.
-
Any process can start an election by sending an election
message containing its id to its successor. If a successor is down, the
message is passed on to the next successor.
-
If a message is passed on, the sender adds its id to the
list. When it gets back to the initiator, every node had a chance to make
its presence known.
-
The initiator sends a coordinator message around the ring
containing a list of all living process ids. The one with the highest priority
is elected as coordinator.
Question: Does it matter if two processes initiate
an election?
Question: What happens if a process crashes during
the election?
COMP 413
Distributed Transactions
-
The transaction model
-
Classification of transactions
-
Concurrency control
COMP 413
Transactions
Essential: All READ and WRITE operations are executed,
i.e . their effects are made permanent, at the execution of END_TRANSACTION.
Observation: Transactions form an atomic
operation.
COMP 413
ACID Properties
Model: A transaction is a collection of operations on
the state of an object (database, object composition, etc.) that satisfies
the following properties:
Atomicity: All operations either succeed, or all
of them fail. When the transaction fails, the state of the object will
remain unaffected by the transaction.
Consistency: A transaction establishes a valid
state transition. This does not exclude the possibility of invalid, intermediate
states during the transaction’ s execution.
Isolation: Concurrent transactions do not interfere
with each other. It appears to each transaction T that other transactions
occur either before T, or after T, but never both.
Durability: After the execution of a transaction,
its effects are made permanent: changes to the state survive failures.
COMP 413
Transaction Classification
Flat transactions: The most familiar one: a sequence
of operations that satisfies the ACID properties.
Flat transactions with savepoints: A refinement
that allows a rollback to a user-defined intermediate state of the transaction,
rather than a complete abort.
Chained transactions: A sequence of transactions
by which resources that are no longer needed are returned, but without
giving up others.
COMP 413
Transaction Classification
Nested transactions: A hierarchy of transactions
that allows (1) concurrent processing of subtransactions, and (2) recovery
per subtransaction.
Distributed transactions: A (flat) transaction
that is executed in a distributed environment => often implemented as a
two-level nested transaction with one subtransaction per node.
Multi-level transactions: Like a nested transaction,
but with full commit facilities per subtransactions . Aborts are compensated
by reverse subtransactions that undo state changes.
COMP 413
Flat Transactions: Limitations
Problem: Flat transactions constitute a very simple
and clean model for dealing with a sequence of operations that satisfies
the ACID properties. However, after a series of successful operations all
changes should be undone in the case of failure. Sometimes unnecessary:
Trip planning. Plan a intercontinental
trip where all flights have been reserved, but filling in the last part
requires some “experimentation.” The first reservations are known to be
in order, but cannot yet be committed.
Bulk updates. When updating bank accounts for monthly
interests we have to lock the entire database (every account should be
updated exactly once: it is a transaction over the entire database.)
Better: each update is immediately committed. However,
in the case of failure, we’ll have to be able to continue where we left
off.
COMP 413
Transactions: Rollback
Two approaches:
-
private workspace (shadow blocks, copy-on-write)
-
writeahead (intentions) log
COMP 413
Transactions: Concurrency
Control
Problem: Increase efficiency by allowing several
transactions to execute at the same time.
Constraint: Effect should be the same as if the
transactions were executed in some serial order.
Solution: Apply two-phase locking techniques
Question: Does it actually make sense to allow
concurrent transactions on a single server?
COMP 413
Transaction Processing Model
Consider a collection E of transactions T1,.....,Tn.
Goal is to conduct a serializable execution of E:
-
Transactions in E are possibly concurrently executed
according to some schedule S.
-
Schedule S is equivalent to some totally ordered execution
of T1,.....,Tn.
Note: Because we’re not concerned with the computations
of each transaction, a transaction can be modeled as a log of read
and write operations.
Two operations OPER(Ti,x) and OPER(Tj,x)
on the same data item x, and from a set of logs may conflict at
a data manager:
read-write conflict (rw): One is a read operation
while the other is a write operation on x.
write-write conflict (ww): Both are write operations
on x.
COMP 413
Basic Scheduling Theorem
Note: The important thing is that we process conflicting
reads and writes in certain relative orders. This is what concurrency control
is all about.
Note: It turns out that read-write and write-write
conflicts can be synchronized independently, as long as we stick to a total
ordering of transactions that is consistent with both types of conflicts.
COMP 413
Synchronization Techniques
Two-phase locking: Before reading or writing a
data item, a lock must be obtained. After a lock is given up, the transaction
is not allowed to acquire any
more locks.
Timestamp ordering: Operations in a transaction
are timestamped, and data managers are forced to handle operations in timestamp
order.
Optimistic control: Don’t prevent things from going
wrong, but correct the situation if conflicts actually did happen. Basic
assumption: you can pull it off
in most cases.
COMP 413
Two-phase Locking
-
Clients do only READ and WRITE operations within
transactions.
-
Locks are granted and released only by scheduler.
-
Locking policy is to avoid conflicts between operations.
COMP 413
Two-phase Locking
-
Rule 1: When client submits OPER(Ti,x), scheduler
tests whether it conflicts with an operation OPER(Tj,x) from some
other client. If no conflict then grant LOCK(Ti,x), otherwise delay
execution of OPER(Ti,x).
Conflicting operations are executed in the same order
as that locks are granted.
-
Rule 2: If LOCK(Ti,x) has been granted, do
not release the lock until OPER(Ti,x) has been executed by data
manager.
Guarantees LOCK -> OPER -> RELEASE order.
-
Rule 3: If RELEASE(Ti,x) has taken place, no
more locks for Ti may be granted.
Guarantees that all pairs of conflicting operations
of two transactions are done in the same order.
COMP 413
Two-phase Locking: Problems
Problem 1: System can deadlock. How? Practical
solutions:
-
put a timeout on locks and abort transaction on expiration.
-
insist that locks are aquired in some canonical order
Problem 2: When should the scheduler actually
release a lock:
-
(1) when operation has been executed
-
(2) when it knows that no more locks will be requested
No good way of testing condition (2) unless transaction has
been committed or aborted.
Moreover: Assume the following execution sequence
takes place: RELEASE(Ti,x) -> LOCK(Tj,x) -> ABORT(Ti,x).
Consequence: scheduler will have to abort Tj
as well (cascaded aborts).
Solution: Release all locks only at commit/abort
time (strict two-phase locking).
COMP 413
Timestamp Ordering
Basic idea:
-
Transaction manager assigns a unique timestamp TS(Ti)
to each transaction Ti.
-
Each operation OPER(Ti,x) submitted by the transaction
manager to the scheduler is timestamped TS(OPER(Ti,x)) <- TS(Ti).
Scheduler adheres to following rules:
-
If OPER(Ti,x) and OPER(Tj,x) conflict then
data manager processes OPER(Ti,x) before OPER(Tj,x) iff TS(OPER(Ti,x))
< TS(OPER(Tj,x)).
-
If TS(OPER(Ti,x)) < TS(OPER(Tj,x)) and Tj,x
has
already committed at the time OPER(Ti,x) is submitted to the scheduler,
then Ti must abort.
Note: rather aggressive for if a single OPER(Ti,x)
is rejected, Ti will have to be aborted.
COMP 413
Timestamp Ordering
-
Suppose: TS(OPER(Ti,x)) < TS(OPER(Tj,x)), but
that OPER(Tj,x) has already been processed by the data manager.
-
Then: the scheduler should reject OPER(Ti,x),
as it came in too late.
-
Suppose: TS(OPER(Ti,x)) < TS(OPER(Tj,x)), and
that OPER(Ti,x) has already been processed by the data manager.
-
Then: the scheduler would submit OPER(Tj,x) to
data manager.
-
Refinement: hold back OPER(Tj,x) until Ti
commits or aborts.
Question: Why would we do this?
COMP 413
Multiversion Timestamp Ordering
Basic idea: Introduce finer granularity by maintaining
versions of data items, rather than just recording times when it was read
or written:
-
Each data item has a set of read timestamps, and a set of
versions:
(write timestamps, data value)-pairs.
-
Reads are never rejected: a READ(TS(Ti),x) returns
the version with largest timestamp smaller than TS(Ti,x).
-
An operation WRITE(TS(Ti),x) is accepted if there
has been no read operation between TS(Ti) and the smallest write
timestamp larger than TS(Ti). In that case, a new version
is created.
COMP 413
Conservative Timestamp Ordering
Basic idea: Operations that may possibly have to
be rejected, are delayed until it is certain that they can be accepted:
-
Just buffer READ(Ti,x) until there is a WRITE(Ti,x)
from every local transaction manager, such that the READ’s timestamp
is smaller than the minimal write timestamp.
-
Just buffer WRITE(Ti,x) until there is a READ(Ti,x)
from
every local transaction manager, such that the WRITE’s timestamp
is smaller than the minimal read timestamp.
Note: We’re assuming ordered message delivery, and
“idle” transaction managers occasionally tell the others that they’ re
not sending any requests.
Observation: This policy is
really conservative.
COMP 413
Optimistic Concurrency Control
Observation: (1) Maintaining locks is costly; (2)
in practice not many conflicts.
Alternative: Go ahead immediately with all operations,
use tentative writes everywhere (shadow copies), and solve conflicts later
on.
Phases: allow operations tentatively -> validate
effects -> make updates permanent.
Validation: Check two basic rules for each pair
of active transactions Ti and Tj:
-
Rule 1: Ti must not read or write data that
has been written by Tj.
-
Rule 2: Tj must not read or write data that
has been written by Ti.
If one of the rules doesn’t hold: abort transaction.
COMP 413
Distributed Transactions
How to achieve atomic commit among multiple transaction
managers in a distributed system?
Two-phase commit protocol:
-
one manager acts as coordinator
|
Coordinator
|
Subordinate(s)
|
| Write "prepare" to log |
|
| Send "prepare" message |
|
|
Write "ready" to log |
|
Send "ready" message |
| Collect all replies (assuming everyone is ready) |
|
| Write log record (transaction committed) |
|
| Send "commit" message |
|
|
Write "commit" to log |
|
Commit |
|
Send "finished" message |
COMP 413