Advanced Distributed

Содержание

Слайд 2

Target Settings

Process ‘group’-based systems
Clouds/Datacenters
Replicated servers
Distributed databases
Crash-stop/Fail-stop process failures

Target Settings Process ‘group’-based systems Clouds/Datacenters Replicated servers Distributed databases Crash-stop/Fail-stop process failures

Слайд 3

Group Membership Service

Application Queries
e.g., gossip, overlays, DHT’s, etc.

Membership
Protocol

Group
Membership List

joins,

Group Membership Service Application Queries e.g., gossip, overlays, DHT’s, etc. Membership Protocol
leaves, failures
of members

Unreliable
Communication

Application Process pi

Membership List

Слайд 4

Two sub-protocols

Application Process pi

Group
Membership List

Unreliable
Communication

Almost-Complete list (focus of this talk)
Gossip-style,

Two sub-protocols Application Process pi Group Membership List Unreliable Communication Almost-Complete list
SWIM, Virtual synchrony, …
Or Partial-random list (other papers)
SCAMP, T-MAN, Cyclon,…

Слайд 5

Large Group: Scalability A Goal

this is us (pi)

1000’s of processes

Process Group

“Members”

Large Group: Scalability A Goal this is us (pi) 1000’s of processes Process Group “Members”

Слайд 6

pj

Group Membership Protocol

Crash-stop Failures only

pj Group Membership Protocol Crash-stop Failures only

Слайд 7

I. pj crashes

Nothing we can do about it!
A frequent occurrence
Common

I. pj crashes Nothing we can do about it! A frequent occurrence
case rather than exception
Frequency goes up at least linearly with size of datacenter

Слайд 8

II. Distributed Failure Detectors: Desirable Properties

Completeness = each failure is detected
Accuracy =

II. Distributed Failure Detectors: Desirable Properties Completeness = each failure is detected
there is no mistaken detection
Speed
Time to first detection of a failure
Scale
Equal Load on each member
Network Message Load

Слайд 9

Distributed Failure Detectors: Properties

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

Distributed Failure Detectors: Properties Completeness Accuracy Speed Time to first detection of
each member
Network Message Load

Слайд 10

What Real Failure Detectors Prefer

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

What Real Failure Detectors Prefer Completeness Accuracy Speed Time to first detection
on each member
Network Message Load

Guaranteed

Partial/Probabilistic
guarantee

Слайд 11

Failure Detector Properties

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Failure Detector Properties Completeness Accuracy Speed Time to first detection of a
member
Network Message Load

Time until some
process detects the failure

Guaranteed

Partial/Probabilistic
guarantee

Слайд 12

Failure Detector Properties

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Failure Detector Properties Completeness Accuracy Speed Time to first detection of a
member
Network Message Load

Time until some
process detects the failure

Guaranteed

Partial/Probabilistic
guarantee

No bottlenecks/single
failure point

Слайд 13

Failure Detector Properties

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Failure Detector Properties Completeness Accuracy Speed Time to first detection of a
member
Network Message Load

In spite of
arbitrary simultaneous
process failures

Слайд 14

Centralized Heartbeating


pi, Heartbeat Seq. l++

pi

pj

Heartbeats sent periodically
If heartbeat not received from

Centralized Heartbeating … pi, Heartbeat Seq. l++ pi pj Heartbeats sent periodically
pi within
timeout, mark pi as failed

Слайд 15

Ring Heartbeating

pi, Heartbeat Seq. l++

pi



pj

Ring Heartbeating pi, Heartbeat Seq. l++ pi … … pj

Слайд 16

All-to-All Heartbeating

pi, Heartbeat Seq. l++


pi

pj

All-to-All Heartbeating pi, Heartbeat Seq. l++ … pi pj

Слайд 17

Gossip-style Heartbeating

Array of
Heartbeat Seq. l
for member subset

pi

Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset pi

Слайд 18

Gossip-Style Failure Detection

1

2

4

3

Protocol:
Nodes periodically gossip their membership list
On receipt, the local

Gossip-Style Failure Detection 1 2 4 3 Protocol: Nodes periodically gossip their
membership list is updated

Current time : 70 at node 2
(asynchronous clocks)

Address

Heartbeat Counter

Time (local)

Fig and animation by: Dongyun Jin and Thuy Ngyuen

Слайд 19

Gossip-Style Failure Detection

If the heartbeat has not increased for more than Tfail

Gossip-Style Failure Detection If the heartbeat has not increased for more than
seconds, the member is considered failed
And after Tcleanup seconds, it will delete the member from the list
Why two different timeouts?

Слайд 20

Gossip-Style Failure Detection

What if an entry pointing to a failed node is

Gossip-Style Failure Detection What if an entry pointing to a failed node
deleted right after Tfail seconds?
Fix: remember for another Tfail

1

2

4

3

Current time : 75 at node 2

Слайд 21

Multi-level Gossiping

Network topology is hierarchical
Random gossip target selection => core routers face

Multi-level Gossiping Network topology is hierarchical Random gossip target selection => core
O(N) load (Why?)
Fix: Select gossip target in subnet I, which contains ni nodes, with probability 1/ni
Router load=O(1)
Dissemination time=O(log(N))
Why?
What about latency for multi-level topologies?
[Gupta et al, TPDS 06]

Router

N/2 nodes in a subnet

N/2 nodes in a subnet

Слайд 22

Analysis/Discussion

What happens if gossip period Tgossip is decreased?
A single heartbeat takes

Analysis/Discussion What happens if gossip period Tgossip is decreased? A single heartbeat
O(log(N)) time to propagate. So: N heartbeats take:
O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N)
O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
What about O(k) bandwidth?
What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
Tradeoff: False positive rate vs. detection time

Слайд 23

Simulations

As # members increases, the detection time increases

As requirement is loosened, the

Simulations As # members increases, the detection time increases As requirement is
detection time decreases

As # failed members increases, the detection time increases significantly

The algorithm is resilient to message loss

Слайд 24

Failure Detector Properties …

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

Failure Detector Properties … Completeness Accuracy Speed Time to first detection of
each member
Network Message Load

Слайд 25

…Are application-defined Requirements

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

…Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a
member
Network Message Load

Guarantee always

Probability PM(T)

T time units

Слайд 26

…Are application-defined Requirements

Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

…Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a
member
Network Message Load

Guarantee always

Probability PM(T)

T time units

N*L: Compare this across protocols

Слайд 27

All-to-All Heartbeating

pi, Heartbeat Seq. l++


pi

Every T units

L=N/T

All-to-All Heartbeating pi, Heartbeat Seq. l++ … pi Every T units L=N/T

Слайд 28

Gossip-style Heartbeating

Array of
Heartbeat Seq. l
for member subset

pi

Every tg units
=gossip period,
send O(N)

Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset pi Every
gossip
message

T=logN * tg

L=N/tg=N*logN/T

Слайд 29

Worst case load L*
as a function of T, PM(T), N
Independent Message

Worst case load L* as a function of T, PM(T), N Independent
Loss probability pml
(proof in PODC 01 paper)

What’s the Best/Optimal we can do?

Слайд 30

Heartbeating

Optimal L is independent of N (!)
All-to-all and gossip-based: sub-optimal
L=O(N/T)
try to achieve

Heartbeating Optimal L is independent of N (!) All-to-all and gossip-based: sub-optimal
simultaneous detection at all processes
fail to distinguish Failure Detection and Dissemination components

Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component

Слайд 31

SWIM Failure Detector Protocol

pj

SWIM Failure Detector Protocol pj

Слайд 32

SWIM versus Heartbeating

Process Load

First Detection
Time

Constant

Constant

O(N)

O(N)

SWIM

For Fixed :
False Positive Rate
Message Loss

SWIM versus Heartbeating Process Load First Detection Time Constant Constant O(N) O(N)
Rate

Heartbeating

Heartbeating

Слайд 33

SWIM Failure Detector

SWIM Failure Detector

Слайд 34

Accuracy, Load
PM(T) is exponential in -K. Also depends on pml (and pf

Accuracy, Load PM(T) is exponential in -K. Also depends on pml (and
)
See paper
for up to 15 % loss rates

Слайд 35

Prob. of being pinged in T’=
E[T ] =
Completeness: Any alive member

Prob. of being pinged in T’= E[T ] = Completeness: Any alive
detects failure
Eventually
By using a trick: within worst case O(N) protocol periods

Detection Time

Слайд 36

III. Dissemination

HOW ?

III. Dissemination HOW ?

Слайд 37

Dissemination Options

Multicast (Hardware / IP)
unreliable
multiple simultaneous multicasts
Point-to-point (TCP / UDP)
expensive
Zero extra

Dissemination Options Multicast (Hardware / IP) unreliable multiple simultaneous multicasts Point-to-point (TCP
messages: Piggyback on Failure Detector messages
Infection-style Dissemination

Слайд 38

Infection-style Dissemination

pj

K random
processes

Infection-style Dissemination pj K random processes

Слайд 39

Infection-style Dissemination

Epidemic style dissemination
After protocol periods, processes would not have heard about

Infection-style Dissemination Epidemic style dissemination After protocol periods, processes would not have
an update
Maintain a buffer of recently joined/evicted processes
Piggyback from this buffer
Prefer recent updates
Buffer elements are garbage collected after a while
After protocol periods; this defines weak consistency

Слайд 40

Suspicion Mechanism

False detections, due to
Perturbed processes
Packet losses, e.g., from congestion
Indirect pinging may

Suspicion Mechanism False detections, due to Perturbed processes Packet losses, e.g., from
not solve the problem
e.g., correlated message losses near pinged host
Key: suspect a process before declaring it as failed in the group

Слайд 41

Suspicion Mechanism

Alive

Suspected

Failed

Dissmn (Suspect pj)

Dissmn (Alive pj)

Dissmn (Failed pj)

pi :: State Machine

Suspicion Mechanism Alive Suspected Failed Dissmn (Suspect pj) Dissmn (Alive pj) Dissmn
for pj view element

FD:: pi ping failed
Dissmn::(Suspect pj)

Time out

FD::pi ping success
Dissmn::(Alive pj)

Слайд 42

Suspicion Mechanism

Distinguish multiple suspicions of a process
Per-process incarnation number
Inc #

Suspicion Mechanism Distinguish multiple suspicions of a process Per-process incarnation number Inc
for pi can be incremented only by pi
e.g., when it receives a (Suspect, pi) message
Somewhat similar to DSDV
Higher inc# notifications over-ride lower inc#’s
Within an inc#: (Suspect inc #) > (Alive, inc #)
Nothing overrides a (Failed, inc #)
See paper

Слайд 43

Time-bounded Completeness

Key: select each membership element once as a ping target in

Time-bounded Completeness Key: select each membership element once as a ping target
a traversal
Round-robin pinging
Random permutation of list after each traversal
Each failure is detected in worst case 2N-1 (local) protocol periods
Preserves FD properties

Слайд 44

Results from an Implementation

Current implementation
Win2K, uses Winsock 2
Uses only UDP messaging
900 semicolons

Results from an Implementation Current implementation Win2K, uses Winsock 2 Uses only
of code (including testing)
Experimental platform
Galaxy cluster: diverse collection of commodity PCs
100 Mbps Ethernet
Default protocol settings
Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)]
No partial membership lists observed in experiments

Слайд 45

Per-process Send and Receive Loads
are independent of group size

Per-process Send and Receive Loads are independent of group size

Слайд 46

Time to First Detection of a process failure

T1

T1+T2+T3

Time to First Detection of a process failure T1 T1+T2+T3

Слайд 47

T1

Time to First Detection of a process failure
apparently uncorrelated to group

T1 Time to First Detection of a process failure apparently uncorrelated to group size T1+T2+T3
size

T1+T2+T3

Слайд 48

Membership Update Dissemination Time
is low at high group sizes

T2

+

T1+T2+T3

Membership Update Dissemination Time is low at high group sizes T2 + T1+T2+T3

Слайд 49

Excess time taken by
Suspicion Mechanism

T3

+

T1+T2+T3

Excess time taken by Suspicion Mechanism T3 + T1+T2+T3

Слайд 50

Benefit of Suspicion Mechanism:
Per-process 10% synthetic packet loss

Benefit of Suspicion Mechanism: Per-process 10% synthetic packet loss

Слайд 51

More discussion points

It turns out that with a partial membership list that

More discussion points It turns out that with a partial membership list
is uniformly random, gossiping retains same properties as with complete membership lists
Why? (Think of the equation)
Partial membership protocols
SCAMP, Cyclon, TMAN, …
Gossip-style failure detection underlies
Astrolabe
Amazon EC2/S3 (rumored!)
SWIM used in
CoralCDN/Oasis anycast service: http://oasis.coralcdn.org
Mike Freedman used suspicion mechanism to blackmark frequently-failing nodes

Слайд 52

Reminder – Due this Sunday April 3rd at 11.59 PM

Project Midterm Report

Reminder – Due this Sunday April 3rd at 11.59 PM Project Midterm
due, 11.59 pm [12pt font, single-sided, 8 + 1 page Business Plan max]
Wiki Term Paper - Second Draft Due (Individual)
Reviews – you only have to submit reviews for 15 sessions (any 15 sessions) from 2/10 to 4/28. Keep track of your count! Take a breather!
Имя файла: Advanced-Distributed.pptx
Количество просмотров: 124
Количество скачиваний: 0