-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimization: Leader log sampled handshake #150
Comments
Some concerns we may need to consider:
|
@joshuazh-x There is no need for the old leader to continue replication if it learns there is a new leader. Any message duplication in this proposal is already possible today (in cases leadership change races with replication, and/or there are connection issues).
I do not expect this to happen in normal operation, because the old leader will be notified about the existence of the new leader, and step down. The only difference is that, with this proposal, the last few append messages that the old leader has sent may have been [partially] accepted into the follower's log rather than outright rejected.
When it learns the new term. For example, this will happen when
Same thing will happen as today. Any leader will try to probe unreachable followers. The old leader will stop doing so and step down when it learns about the new leader, or things like
The moment it learns about the new leader |
Background: #144
At the moment, a
raft
node only acceptsMsgApp
log appends from the latest leader it knows about, i.e. whenMsgApp.Term == raft.Term
. This restriction could be relaxed, which can reduce the message turnaround during the times when the leader changes.The safety requirement is that we don't accept entries that are not in the
raft.Term
leader log. If we can deduce that an entry is in the leader's log (before / other than by getting aMsgApp
directly from this leader), we can always safely accept it.One way to achieve this:
(term, index)
of the last entry of the new leader's log. If the election wins, the new leader will not overwrite entries up to thisindex
, and will append new entries strictly after it.MsgApp
(from any leader) that contains this entry, we have the guarantee that all entries <=index
in this append are contained in the leader's log. It is safe to accept them.A more general way to achieve this is:
(index, term)
, but also a sample ofK
other(index, term)
in its log. Specifically, it would be wise to attach the "fork" points of the lastK
terms.MsgApp
, it can deduce from this sample the overlap between this append message and the leader's log. The overlapping part can be safely accepted regardless of who sent it.The practical
K
would be 2 or 3, because leader changes are typically not frequent. 2 or 3 last term changes cover a significant section of the log.This sampling technique is equivalent to the fork point search that the leader does in the
StateProbe
state to establish the longest common prefix with the follower's log before transitioning it to the optimisticStateReplicate
state.This gives significant benefits:
K
fork points rather than just the latest one, we increase chances of finding an overlap immediately, and reduce message turnaround.MsgApp
in theStateProbe
, and will typically be able to transition straight toStateReplicate
.MsgApp.Entries
(that arrived just slightly late) from a recent leader who is stepping down.This technique will minimize cluster disruption / slowdown during election, and reduce tail replication/commit latency in some cases.
The text was updated successfully, but these errors were encountered: