Advanced Distributed

Февраль 11, 2021

Главная
Разное
Advanced Distributed

Содержание

2. Target Settings Process ‘group’-based systems Clouds/Datacenters Replicated servers Distributed databases Crash-stop/Fail-stop process failures
3. Group Membership Service Application Queries e.g., gossip, overlays, DHT’s, etc. Membership Protocol Group Membership List joins,
4. Two sub-protocols Application Process pi Group Membership List Unreliable Communication Almost-Complete list (focus of this talk)
5. Large Group: Scalability A Goal this is us (pi) 1000’s of processes Process Group “Members”
6. pj Group Membership Protocol Crash-stop Failures only
7. I. pj crashes Nothing we can do about it! A frequent occurrence Common case rather than
8. II. Distributed Failure Detectors: Desirable Properties Completeness = each failure is detected Accuracy = there is
9. Distributed Failure Detectors: Properties Completeness Accuracy Speed Time to first detection of a failure Scale Equal
10. What Real Failure Detectors Prefer Completeness Accuracy Speed Time to first detection of a failure Scale
11. Failure Detector Properties Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load
12. Failure Detector Properties Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load
13. Failure Detector Properties Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load
14. Centralized Heartbeating … pi, Heartbeat Seq. l++ pi pj Heartbeats sent periodically If heartbeat not received
15. Ring Heartbeating pi, Heartbeat Seq. l++ pi … … pj
16. All-to-All Heartbeating pi, Heartbeat Seq. l++ … pi pj
17. Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset pi
18. Gossip-Style Failure Detection 1 2 4 3 Protocol: Nodes periodically gossip their membership list On receipt,
19. Gossip-Style Failure Detection If the heartbeat has not increased for more than Tfail seconds, the member
20. Gossip-Style Failure Detection What if an entry pointing to a failed node is deleted right after
21. Multi-level Gossiping Network topology is hierarchical Random gossip target selection => core routers face O(N) load
22. Analysis/Discussion What happens if gossip period Tgossip is decreased? A single heartbeat takes O(log(N)) time to
23. Simulations As # members increases, the detection time increases As requirement is loosened, the detection time
24. Failure Detector Properties … Completeness Accuracy Speed Time to first detection of a failure Scale Equal
25. …Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load
26. …Are application-defined Requirements Completeness Accuracy Speed Time to first detection of a failure Scale Equal Load
27. All-to-All Heartbeating pi, Heartbeat Seq. l++ … pi Every T units L=N/T
28. Gossip-style Heartbeating Array of Heartbeat Seq. l for member subset pi Every tg units =gossip period,
29. Worst case load L* as a function of T, PM(T), N Independent Message Loss probability pml
30. Heartbeating Optimal L is independent of N (!) All-to-all and gossip-based: sub-optimal L=O(N/T) try to achieve
31. SWIM Failure Detector Protocol pj
32. SWIM versus Heartbeating Process Load First Detection Time Constant Constant O(N) O(N) SWIM For Fixed :
33. SWIM Failure Detector
34. Accuracy, Load PM(T) is exponential in -K. Also depends on pml (and pf ) See paper
35. Prob. of being pinged in T’= E[T ] = Completeness: Any alive member detects failure Eventually
36. III. Dissemination HOW ?
37. Dissemination Options Multicast (Hardware / IP) unreliable multiple simultaneous multicasts Point-to-point (TCP / UDP) expensive Zero
38. Infection-style Dissemination pj K random processes
39. Infection-style Dissemination Epidemic style dissemination After protocol periods, processes would not have heard about an update
40. Suspicion Mechanism False detections, due to Perturbed processes Packet losses, e.g., from congestion Indirect pinging may
41. Suspicion Mechanism Alive Suspected Failed Dissmn (Suspect pj) Dissmn (Alive pj) Dissmn (Failed pj) pi ::
42. Suspicion Mechanism Distinguish multiple suspicions of a process Per-process incarnation number Inc # for pi can
43. Time-bounded Completeness Key: select each membership element once as a ping target in a traversal Round-robin
44. Results from an Implementation Current implementation Win2K, uses Winsock 2 Uses only UDP messaging 900 semicolons
45. Per-process Send and Receive Loads are independent of group size
46. Time to First Detection of a process failure T1 T1+T2+T3
47. T1 Time to First Detection of a process failure apparently uncorrelated to group size T1+T2+T3
48. Membership Update Dissemination Time is low at high group sizes T2 + T1+T2+T3
49. Excess time taken by Suspicion Mechanism T3 + T1+T2+T3
50. Benefit of Suspicion Mechanism: Per-process 10% synthetic packet loss
51. More discussion points It turns out that with a partial membership list that is uniformly random,
52. Reminder – Due this Sunday April 3rd at 11.59 PM Project Midterm Report due, 11.59 pm
54. Скачать презентацию

Target Settings
Process ‘group’-based systems
Clouds/Datacenters
Replicated servers
Distributed databases
Crash-stop/Fail-stop process failures

Group Membership Service
Application Queries
e.g., gossip, overlays, DHT’s, etc.
Membership
Protocol
Group
Membership List
joins,

leaves, failures
of members

Unreliable
Communication

Application Process pi

Membership List

Two sub-protocols
Application Process pi
Group
Membership List
Unreliable
Communication
Almost-Complete list (focus of this talk)
Gossip-style,

SWIM, Virtual synchrony, …
Or Partial-random list (other papers)
SCAMP, T-MAN, Cyclon,…

Large Group: Scalability A Goal
this is us (pi)
1000’s of processes
Process Group
“Members”

pj
Group Membership Protocol
Crash-stop Failures only

I. pj crashes
Nothing we can do about it!
A frequent occurrence
Common

case rather than exception
Frequency goes up at least linearly with size of datacenter

II. Distributed Failure Detectors: Desirable Properties
Completeness = each failure is detected
Accuracy =

there is no mistaken detection
Speed
Time to first detection of a failure
Scale
Equal Load on each member
Network Message Load

Distributed Failure Detectors: Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

each member
Network Message Load

What Real Failure Detectors Prefer
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

on each member
Network Message Load

Guaranteed

Partial/Probabilistic
guarantee

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

member
Network Message Load

Time until some
process detects the failure

Guaranteed

Partial/Probabilistic
guarantee

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

member
Network Message Load

Time until some
process detects the failure

Guaranteed

Partial/Probabilistic
guarantee

No bottlenecks/single
failure point

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

member
Network Message Load

In spite of
arbitrary simultaneous
process failures

Centralized Heartbeating
…
pi, Heartbeat Seq. l++
pi
pj
Heartbeats sent periodically
If heartbeat not received from

pi within
timeout, mark pi as failed

Ring Heartbeating
pi, Heartbeat Seq. l++
pi
…
…
pj

All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
pj

Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi

Gossip-Style Failure Detection
1
2
4
3
Protocol:
Nodes periodically gossip their membership list
On receipt, the local

membership list is updated

Current time : 70 at node 2
(asynchronous clocks)

Address

Heartbeat Counter

Time (local)

Fig and animation by: Dongyun Jin and Thuy Ngyuen

Gossip-Style Failure Detection
If the heartbeat has not increased for more than Tfail

seconds, the member is considered failed
And after Tcleanup seconds, it will delete the member from the list
Why two different timeouts?

Gossip-Style Failure Detection
What if an entry pointing to a failed node is

deleted right after Tfail seconds?
Fix: remember for another Tfail

Current time : 75 at node 2

Multi-level Gossiping
Network topology is hierarchical
Random gossip target selection => core routers face

O(N) load (Why?)
Fix: Select gossip target in subnet I, which contains ni nodes, with probability 1/ni
Router load=O(1)
Dissemination time=O(log(N))
Why?
What about latency for multi-level topologies?
[Gupta et al, TPDS 06]

Router

N/2 nodes in a subnet

Слайд 22

Analysis/Discussion
What happens if gossip period Tgossip is decreased?
A single heartbeat takes

O(log(N)) time to propagate. So: N heartbeats take:
O(log(N)) time to propagate, if bandwidth allowed per node is allowed to be O(N)
O(N.log(N)) time to propagate, if bandwidth allowed per node is only O(1)
What about O(k) bandwidth?
What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased?
Tradeoff: False positive rate vs. detection time

Слайд 23

Simulations
As # members increases, the detection time increases
As requirement is loosened, the

detection time decreases

As # failed members increases, the detection time increases significantly

The algorithm is resilient to message loss

Слайд 24

Failure Detector Properties …
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

each member
Network Message Load

Слайд 25

…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

member
Network Message Load

Guarantee always

Probability PM(T)

T time units

Слайд 26

…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

member
Network Message Load

Guarantee always

Probability PM(T)

T time units

N*L: Compare this across protocols

Слайд 27

All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
Every T units
L=N/T

Слайд 28

Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi
Every tg units
=gossip period,
send O(N)

gossip
message

T=logN * tg

L=N/tg=N*logN/T

Слайд 29

Worst case load L*
as a function of T, PM(T), N
Independent Message

Loss probability pml
(proof in PODC 01 paper)

What’s the Best/Optimal we can do?

Слайд 30

Heartbeating
Optimal L is independent of N (!)
All-to-all and gossip-based: sub-optimal
L=O(N/T)
try to achieve

simultaneous detection at all processes
fail to distinguish Failure Detection and Dissemination components

Key:
Separate the two components
Use a non heartbeat-based Failure Detection Component

Слайд 31

SWIM Failure Detector Protocol
pj

Слайд 32

SWIM versus Heartbeating
Process Load
First Detection
Time
Constant
Constant
O(N)
O(N)
SWIM
For Fixed :
False Positive Rate
Message Loss

Rate

Heartbeating

Слайд 33

SWIM Failure Detector

Слайд 34

Accuracy, Load
PM(T) is exponential in -K. Also depends on pml (and pf

)
See paper
for up to 15 % loss rates

Слайд 35

Prob. of being pinged in T’=
E[T ] =
Completeness: Any alive member

detects failure
Eventually
By using a trick: within worst case O(N) protocol periods

Detection Time

Слайд 36

III. Dissemination
HOW ?

Слайд 37

Dissemination Options
Multicast (Hardware / IP)
unreliable
multiple simultaneous multicasts
Point-to-point (TCP / UDP)
expensive
Zero extra

messages: Piggyback on Failure Detector messages
Infection-style Dissemination

Слайд 38

Infection-style Dissemination
pj
K random
processes

Слайд 39

Infection-style Dissemination
Epidemic style dissemination
After protocol periods, processes would not have heard about

an update
Maintain a buffer of recently joined/evicted processes
Piggyback from this buffer
Prefer recent updates
Buffer elements are garbage collected after a while
After protocol periods; this defines weak consistency

Слайд 40

Suspicion Mechanism
False detections, due to
Perturbed processes
Packet losses, e.g., from congestion
Indirect pinging may

not solve the problem
e.g., correlated message losses near pinged host
Key: suspect a process before declaring it as failed in the group

Слайд 41

Suspicion Mechanism
Alive
Suspected
Failed
Dissmn (Suspect pj)
Dissmn (Alive pj)
Dissmn (Failed pj)
pi :: State Machine

for pj view element

FD:: pi ping failed
Dissmn::(Suspect pj)

Time out

FD::pi ping success
Dissmn::(Alive pj)

Слайд 42

Suspicion Mechanism
Distinguish multiple suspicions of a process
Per-process incarnation number
Inc #

for pi can be incremented only by pi
e.g., when it receives a (Suspect, pi) message
Somewhat similar to DSDV
Higher inc# notifications over-ride lower inc#’s
Within an inc#: (Suspect inc #) > (Alive, inc #)
Nothing overrides a (Failed, inc #)
See paper

Слайд 43

Time-bounded Completeness
Key: select each membership element once as a ping target in

a traversal
Round-robin pinging
Random permutation of list after each traversal
Each failure is detected in worst case 2N-1 (local) protocol periods
Preserves FD properties

Слайд 44

Results from an Implementation
Current implementation
Win2K, uses Winsock 2
Uses only UDP messaging
900 semicolons

of code (including testing)
Experimental platform
Galaxy cluster: diverse collection of commodity PCs
100 Mbps Ethernet
Default protocol settings
Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)]
No partial membership lists observed in experiments

Слайд 45

Per-process Send and Receive Loads
are independent of group size

Слайд 46

Time to First Detection of a process failure
T1
T1+T2+T3

Слайд 47

T1
Time to First Detection of a process failure
apparently uncorrelated to group

size

T1+T2+T3

Слайд 48

Membership Update Dissemination Time
is low at high group sizes
T2
+
T1+T2+T3

Слайд 49

Excess time taken by
Suspicion Mechanism
T3
+
T1+T2+T3

Слайд 50

Benefit of Suspicion Mechanism:
Per-process 10% synthetic packet loss

Слайд 51

More discussion points
It turns out that with a partial membership list that

is uniformly random, gossiping retains same properties as with complete membership lists
Why? (Think of the equation)
Partial membership protocols
SCAMP, Cyclon, TMAN, …
Gossip-style failure detection underlies
Astrolabe
Amazon EC2/S3 (rumored!)
SWIM used in
CoralCDN/Oasis anycast service: http://oasis.coralcdn.org
Mike Freedman used suspicion mechanism to blackmark frequently-failing nodes

Слайд 52

Reminder – Due this Sunday April 3rd at 11.59 PM
Project Midterm Report

due, 11.59 pm [12pt font, single-sided, 8 + 1 page Business Plan max]
Wiki Term Paper - Second Draft Due (Individual)
Reviews – you only have to submit reviews for 15 sessions (any 15 sessions) from 2/10 to 4/28. Keep track of your count! Take a breather!

Advanced Distributed

Содержание

Target SettingsProcess ‘group’-based systemsClouds/Datacenters Replicated serversDistributed databasesCrash-stop/Fail-stop process failures

Group Membership ServiceApplication Queries e.g., gossip, overlays, DHT’s, etc.MembershipProtocolGroup Membership List joins,

Two sub-protocolsApplication Process piGroup Membership ListUnreliable CommunicationAlmost-Complete list (focus of this talk)Gossip-style,

Large Group: Scalability A Goalthis is us (pi)1000’s of processesProcess Group“Members”

pjGroup Membership ProtocolCrash-stop Failures only

I. pj crashes Nothing we can do about it! A frequent occurrenceCommon

II. Distributed Failure Detectors: Desirable PropertiesCompleteness = each failure is detectedAccuracy =

Distributed Failure Detectors: PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on

What Real Failure Detectors PreferCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load

Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each

Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each

Failure Detector PropertiesCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each

Centralized Heartbeating…pi, Heartbeat Seq. l++ pipjHeartbeats sent periodicallyIf heartbeat not received from

Ring Heartbeatingpi, Heartbeat Seq. l++pi……pj

All-to-All Heartbeatingpi, Heartbeat Seq. l++…pipj

Gossip-style HeartbeatingArray of Heartbeat Seq. lfor member subsetpi

Gossip-Style Failure Detection1243Protocol: Nodes periodically gossip their membership listOn receipt, the local

Gossip-Style Failure DetectionIf the heartbeat has not increased for more than Tfail

Gossip-Style Failure DetectionWhat if an entry pointing to a failed node is

Multi-level GossipingNetwork topology is hierarchicalRandom gossip target selection => core routers face

Analysis/DiscussionWhat happens if gossip period Tgossip is decreased? A single heartbeat takes

SimulationsAs # members increases, the detection time increasesAs requirement is loosened, the

Failure Detector Properties …CompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on

…Are application-defined RequirementsCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each

…Are application-defined RequirementsCompletenessAccuracySpeedTime to first detection of a failureScaleEqual Load on each

All-to-All Heartbeatingpi, Heartbeat Seq. l++…piEvery T unitsL=N/T

Gossip-style HeartbeatingArray of Heartbeat Seq. lfor member subsetpiEvery tg units=gossip period,send O(N)

Worst case load L* as a function of T, PM(T), NIndependent Message

HeartbeatingOptimal L is independent of N (!)All-to-all and gossip-based: sub-optimalL=O(N/T)try to achieve

SWIM Failure Detector Protocolpj

SWIM versus HeartbeatingProcess LoadFirst DetectionTimeConstantConstantO(N)O(N)SWIMFor Fixed : False Positive Rate Message Loss

SWIM Failure Detector

Accuracy, LoadPM(T) is exponential in -K. Also depends on pml (and pf

Prob. of being pinged in T’=E[T ] = Completeness: Any alive member

III. DisseminationHOW ?

Dissemination OptionsMulticast (Hardware / IP)unreliable multiple simultaneous multicastsPoint-to-point (TCP / UDP)expensiveZero extra

Infection-style DisseminationpjK randomprocesses

Infection-style DisseminationEpidemic style disseminationAfter protocol periods, processes would not have heard about

Suspicion MechanismFalse detections, due toPerturbed processesPacket losses, e.g., from congestionIndirect pinging may

Suspicion MechanismAliveSuspectedFailedDissmn (Suspect pj)Dissmn (Alive pj)Dissmn (Failed pj) pi :: State Machine

Suspicion MechanismDistinguish multiple suspicions of a process Per-process incarnation number Inc #

Time-bounded CompletenessKey: select each membership element once as a ping target in

Results from an ImplementationCurrent implementationWin2K, uses Winsock 2Uses only UDP messaging900 semicolons

Per-process Send and Receive Loads are independent of group size

Time to First Detection of a process failure T1T1+T2+T3

T1Time to First Detection of a process failure apparently uncorrelated to group

Membership Update Dissemination Time is low at high group sizesT2+T1+T2+T3

Excess time taken by Suspicion MechanismT3+T1+T2+T3

Benefit of Suspicion Mechanism:Per-process 10% synthetic packet loss

More discussion pointsIt turns out that with a partial membership list that

Reminder – Due this Sunday April 3rd at 11.59 PMProject Midterm Report

Похожие презентации

Target Settings
Process ‘group’-based systems
Clouds/Datacenters
Replicated servers
Distributed databases
Crash-stop/Fail-stop process failures

Group Membership Service
Application Queries
e.g., gossip, overlays, DHT’s, etc.
Membership
Protocol
Group
Membership List
joins,

Two sub-protocols
Application Process pi
Group
Membership List
Unreliable
Communication
Almost-Complete list (focus of this talk)
Gossip-style,

Large Group: Scalability A Goal
this is us (pi)
1000’s of processes
Process Group
“Members”

pj
Group Membership Protocol
Crash-stop Failures only

I. pj crashes
Nothing we can do about it!
A frequent occurrence
Common

II. Distributed Failure Detectors: Desirable Properties
Completeness = each failure is detected
Accuracy =

Distributed Failure Detectors: Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

What Real Failure Detectors Prefer
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Failure Detector Properties
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

Centralized Heartbeating
…
pi, Heartbeat Seq. l++
pi
pj
Heartbeats sent periodically
If heartbeat not received from

Ring Heartbeating
pi, Heartbeat Seq. l++
pi
…
…
pj

All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
pj

Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi

Gossip-Style Failure Detection
1
2
4
3
Protocol:
Nodes periodically gossip their membership list
On receipt, the local

Gossip-Style Failure Detection
If the heartbeat has not increased for more than Tfail

Gossip-Style Failure Detection
What if an entry pointing to a failed node is

Multi-level Gossiping
Network topology is hierarchical
Random gossip target selection => core routers face

Analysis/Discussion
What happens if gossip period Tgossip is decreased?
A single heartbeat takes

Simulations
As # members increases, the detection time increases
As requirement is loosened, the

Failure Detector Properties …
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on

…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

…Are application-defined Requirements
Completeness
Accuracy
Speed
Time to first detection of a failure
Scale
Equal Load on each

All-to-All Heartbeating
pi, Heartbeat Seq. l++
…
pi
Every T units
L=N/T

Gossip-style Heartbeating
Array of
Heartbeat Seq. l
for member subset
pi
Every tg units
=gossip period,
send O(N)

Worst case load L*
as a function of T, PM(T), N
Independent Message

Heartbeating
Optimal L is independent of N (!)
All-to-all and gossip-based: sub-optimal
L=O(N/T)
try to achieve

SWIM Failure Detector Protocol
pj

SWIM versus Heartbeating
Process Load
First Detection
Time
Constant
Constant
O(N)
O(N)
SWIM
For Fixed :
False Positive Rate
Message Loss

Accuracy, Load
PM(T) is exponential in -K. Also depends on pml (and pf

Prob. of being pinged in T’=
E[T ] =
Completeness: Any alive member

III. Dissemination
HOW ?

Dissemination Options
Multicast (Hardware / IP)
unreliable
multiple simultaneous multicasts
Point-to-point (TCP / UDP)
expensive
Zero extra

Infection-style Dissemination
pj
K random
processes

Infection-style Dissemination
Epidemic style dissemination
After protocol periods, processes would not have heard about

Suspicion Mechanism
False detections, due to
Perturbed processes
Packet losses, e.g., from congestion
Indirect pinging may

Suspicion Mechanism
Alive
Suspected
Failed
Dissmn (Suspect pj)
Dissmn (Alive pj)
Dissmn (Failed pj)
pi :: State Machine

Suspicion Mechanism
Distinguish multiple suspicions of a process
Per-process incarnation number
Inc #

Time-bounded Completeness
Key: select each membership element once as a ping target in

Results from an Implementation
Current implementation
Win2K, uses Winsock 2
Uses only UDP messaging
900 semicolons

Per-process Send and Receive Loads
are independent of group size

Time to First Detection of a process failure
T1
T1+T2+T3

T1
Time to First Detection of a process failure
apparently uncorrelated to group

Membership Update Dissemination Time
is low at high group sizes
T2
+
T1+T2+T3

Excess time taken by
Suspicion Mechanism
T3
+
T1+T2+T3

Benefit of Suspicion Mechanism:
Per-process 10% synthetic packet loss

More discussion points
It turns out that with a partial membership list that

Reminder – Due this Sunday April 3rd at 11.59 PM
Project Midterm Report