Term
What are some challenges in designing a distributed applicaiton? |
|
Definition
- Time management
- Process communication
- Security
- Scalability
- Failure handling
|
|
|
Term
Describe how time management can be a challenge in designing a Distributed applicaiton and explain how this can be addressed |
|
Definition
Some distributed algorithms depend on clock synchronization, however, full synchronization is not possible and hence presents one challenge of distributed system.
Can be addressed with algorithms like Christian's algorithm (Poll central time server) or the Berkeley's algorithm (An average of all clock times). Other solutions include logical time such as lamport clocks, and enhancing them by using Vector clocks. |
|
|
Term
What are the who time management algorithms? |
|
Definition
There are two synchronization methods that attempt to achieve near-full synchronization: Christian’s algorithm (Poll a central time server and factor in transmission time) and Berkeley’s algorithm (A central process polls all other processes and averages their clocks. Returns to each process the amount by which it must alter its clock). There is a possibility for error in both circumstances so neither can guarantee full synchronization.
|
|
|
Term
What is the logical clock solution for time management issues in DS? |
|
Definition
The ordering of events is another solution through the use of logical time and logical clocks. The happened before relation builds on these constructs, and aids the issue of ordering events and knowing which events preceded and followed a certain event. Lamport clocks are one example of logical clocks. They don’t use time in the sense of hours/minutes/seconds - but rather a counter which indicates how many tasks have been executed. This can be useful when it is vital that an event knows which events have occurred before it. The user of vector clocks, where each process stores a count of how many events have occurred in each process can provide a happened before relationship between two events that lamport’s logical clocks cannot. |
|
|
Term
Describe how Process communication can be a challenge in designing a Distributed applicaiton and explain how this can be addressed |
|
Definition
There are many instances in the life of a distributed system where processes must communicate to achieve some goal i.e Elect a new leader, agree on a value (Consesnsus). Process may also need to communicate to ensure mutual exclusion is upheld, and deal with ommission, time and arbitrary failures which may affect inter-process communication.
Solutions: Bully & ring election (For leaders).
Central server, Ring based (both token orientated) and Maekawa’s algorithm. (Mutual exclusion).
|
|
|
Term
Describe how Scalability can be a challenge in designing a Distributed applicaiton and explain how this can be addressed? |
|
Definition
Distributed systems must remain effective when the number of users and resources increase significantly. The quantity of resources needed should be O(number of users). The algorithm performance should also scale well with the number of users and resources to ensure there is no loss in quality of service when there are large number of users on the system.
Solutions: Distributed algorithms and data should be decentralised to avoid performance bottlenecks. |
|
|
Term
Describe how Failure Handling can be a challenge in designing a Distributed applicaiton and explain how this can be addressed? |
|
Definition
Failure handling is a very important part of distributed systems. A distributed system should be able to construct a reliable service over unreliable networks.
Omission failures are a type of failure where a process fails to perform the intended actions. If a process crashes, a system can use timeouts to detect the failure (only applicable in synchronous systems). If the process drops messages due to lack of buffer space or network transmission errors, the application can use message sequence numbers to detect this failure.
Timing failures are present in synchronous systems. Time limits can be set on process execution time and message delivery time.
Arbitrary (Byzantine) failures are wrong process/channel behaviour. It includes arbitrary communication failures: corrupted messages/delivery of non-existing messages or duplicate messages - e.g. process steps are ommited/additional steps taken. Solutions include checksums/message sequence numbers to detect.
Failures can be masked by hiding (detection and recovery). For example, hiding communication omission failures can be done through a protocol which retransmits lost messages. Hiding process crashes is done by replacing a crashed process and restoring it’s memory. Failures can also be masked by converting to a more acceptable type of failure, for example, a system can use checksums to convert an arbitrary communication failure(corrupted messages) to an omission failure.
|
|
|
Term
Describe the types of failures that can occur in distributed systems |
|
Definition
There are two different places where failures can occur - either on the server or on the client; hence there are client failures and server failures. These can be detected through the use of timeouts - where after a certain time, a process is deemed to have failed. These are implemented by failure detectors.
The two types of failure detectors are Suspected and Guaranteed. Suspected failure detectors only notify other processes Suspected/Unsuspected, while Guaranteed failure detectors are much more complex, and produce Guaranteed/Unsuspected reports. On the server side the damage of failures must be reduced by ensuring that there is stable storage. This prevents the loss of any transactions, modifications, or data written to the server.
Types of Failures:
-
Omission Failures - where a process/channel fails to perform intended actions. For example, if a process crashes, timeouts can be used to detect them in a synchronous system.
-
Timing Failures - where a process fails to meet limits set on the process execution time, or message delivery time.
- Arbitrary (Byzantine) failures - where the system fails in arbitrary ways, for example, by processing requests incorrectly, corrupting their local state, or producing incorrect or inconsistent outputs. Process steps may have been omitted, additional steps may have been taken - it is difficult to detect what has happened. Arbitrary communication failures include corrupted messages, delivery of non-existent messages or duplicate messages. Checksums and message sequence numbers can be used to detect these failures.
|
|
|
Term
Define the happened before relation on events in a distributed system and describe how logical clocks can be used to capture this relation numerically |
|
Definition
Time is needed for data consistency. Logical clocks are used to deduce if an event in a given process occurred before/after/concurrently with an event in another process.
The happened before relationship a -> b indicates that a happened before b in either the same process, two different processes (where a is the sending of a message and b is the receipt of the message in another process), or that a -> c and c -> b.
a -> b implies that the timestamp of a is less that the timestamp of b.
a || b indicates that a happened concurrently with b, the timestamps of a and b can be any value, but they are no in the same ‘chain of events’ so cannot be ordered.
Lamport clocks processes each put timestamps on their messages.
-
If a process receives a timestamped message from another process, which is higher than it’s current timestamp, it increments the higher time stamp and delivers the message
-
If a process receives a timestamped message from another process, which is lower than it’s current timestamp, it increments its current timestamp and delivers the message
These numbers are not in any way related to physical time -- that is why they are called logical clocks. These are generally implemented using counters, which increase each time an event occurs. |
|
|
Term
Describe the distributed mutual exclusion problem and list two correctness properties a distributed mutual exclusion algorithm must satisfy |
|
Definition
Distributed mutual exclusion algorithms are used to ensure data consistency and are used to prevent process interference. These are often build on top of the message passing paradigm; in some algorithms using multicast, using a token held by a central server, or a token which is passed around in a ring (Possession of token allows entry to critical region). Once a process has been satisfied some condition, it can then enter a ‘critical region’ where it is guaranteed to be the only process allowed to access the critical region.
Two correctness properties that a distributed mutual exclusion algorithm must satisfy are Safety- which ensures at most one process can be inside a critical region at any given point, and Liveness which ensures that a request to enter the critical region is eventually granted. A third possible is Maintains Happened Before Ordering so access ordering honors the ordering of requests
|
|
|
Term
In the central server algorithm for mutual exclusion, describe a situation in which requests are not processed in the happened-before order |
|
Definition
We have two processes A and B. Processes A wants to enter the critical region for a shared resource between A and B. Process Asends a request , ra, to the server asking for the token. Immediately after it has done this it sends a message m to process B. On receipt of m, B sends a request rb for the token to the server. To satisfy happened-before order, ra should be granted before rb. However, due to the vagaries of message propagation delay (e.g. process Acould be on the other side of the world), rb arrives at the server before ra and they are serviced in the wrong order. |
|
|
Term
Show how logical clocks can be used in conjunction with multicast in order to implement a distributed mutual exclusion algorithm |
|
Definition
To use a combination of multicast and logical clocks, each process would have to keep a state - Released, Held or Wanted, as well as a timestamp. If a process wished to enter the critical region, it would multicast the rest of the group.
If all the other processes replied with a Released state message, the requesting process would be able to enter.
If another process has it’s state set to Wanted, the process with the lower timestamp would be granted permission to go inside the critical region.
The process with its state set to HELD will reply when it exits the CR.
One instance of this algorithm is Ricart and Agrawala’s. This method achieves correctness, but performance may be slow at times - as the requesting process must wait for all other processes to respond before making a decision
|
|
|
Term
State the leader election problem and list two correctness properties a leader election algorithm must satisfy |
|
Definition
The leader election problem is the process of designating a singleprocess as the organizer of some task distributed among several nodes. The algorithms used should ensure that each process agrees on a choice of leader and that the algorithms themselves can withstand process failures. Multiple elections may be held at the same time as there is no coordination amongst the processes. The process with the largest ID becomes the coordinator.
Two correctness properties these algorithms must satisfy are:
Safety - Each participant process must have elected = ⊥ (no decision yet made), or elected = P, where P is the non-crashed process with the largest identifier
Liveness - All processes participate, and eventually change their elected variable to something other than ⊥, or crash (i.e all processes eventually pick a leader, or crash) |
|
|
Term
Describe the ring based electin algorithm and the assumptions it makes in order to ensure correctness. How does this algorithm prevent unnecessary concurrent elections? |
|
Definition
Processes are arranged in a logical ring (not necessarily the same as their physical/logical ordering). Initially all the processes are set to nonparticipants. This is important, as "Participant" and "nonparticipant" states are used so that when multiple processes start an election at roughly the same time, only a single winner will be announced.
Any process can call an election; it marks itself as a participant, places it’s ID in an election message, and sends the message to it’s neighbour.
Upon receiving an election message, if the current process ID is less than the one in the message, it simply forwards the message on. If it is greater, the current process ID is placed in the message and forwarded on. If the ID is it’s own, a coordinator has been found. It stops forwarding the current message, and instead sends out an elected message with it’s ID enclosed. The current process marks itself as a participant.
If the current process is already marked as a participant when the election message is received , it discards the election message as multiple elections have been started.
Upon receiving an elected message, if the ID is it’s own - the process will stop forwarding the elected message, else it will continue forwarding the message and set its own elected variable to the process ID within the elected message.
This algorithm makes assumptions that no processes crash, and that there is reliable communication (all messages eventually delivered), as well as that there are only a finite number of processes, which can each only take a finite amount of time inside the critical region
|
|
|
Term
Give an example execution of the bully election algorithm to illustrate how this algorithm supports dealing with process failure |
|
Definition
Any process can initiate an election if it has just recovered from a failure, or if it notices that a coordinator has failed (Through timeout). It sends an Election message to all processes with higher IDs (must be known beforehand).
If it does not receive any response message, it becomes the coordinator and sends a coordinator message to all processes with lower IDs.
If it receives a response message, then it waits until it receives a coordinator message from that process. The higher processes will then have a similar election amongst themselves and the result will propagate down.
This algorithm deals with process failure through the use of timeouts. If the process who called the election has not had a response within a certain time, it will assume that the other processes have failed. If a coordinator is elected which then fails, failure detectors are used as means to detect this - in which case another election would be called.
|
|
|
Term
Describe the differences between basic and reliable multicast, in terms of the assumptions they make and the guarantees they offer |
|
Definition
Assumptions of both:
Guarantees of both:
Guarantees of basic:
Guarantees of reliable:
Differences:
|
|
|
Term
Reliable multicast means that even if the sender crashes, as long as it has sent at least one message, the message will still be received by the group.
How does this differ from basic which only sends one message as well? |
|
Definition
|
|
Term
Describe the difference between FIFO ordered, totally ordered and casually ordered multcast in terms of the guarantees they offer |
|
Definition
Reliable multicast does not ensure the order of message on delivery, hence ordered multicast is required.
FIFO ordering: If a correct process issues multicast(g,m) and then multicast(g,m’), then every correct process that delivers m’ will have already delivered m.
Causal ordering: If multicast(g,m) -> multicast(g,m’) then any correct process that delivers m’ will have already delivered m.
Total ordering: If a correct process delivers message m before m’ (independent of the senders), then any other correct process that delivers m’ will have already delivered m.
|
|
|
Term
Describe the consensus problem in distributed systems, and describe the types of failures algorithms must cope with |
|
Definition
The consensus problem is one where processes must agree on an outcome after a value is proposed. Communication is assumed to be reliable, as well as the fact that processes communicate via message passing. The assumption is that the only failures are:
- Crash: processes simply stop sending values. These are difficult to detect in an asynchronous systems
- Arbitrary failures: bugs, or deliberate failures such as byzantine failures, where processes purposefully lie and send different values to different processes.
|
|
|
Term
List the correctness properties which a distributed consensus algorithm must satisfy |
|
Definition
|
|