-
Notifications
You must be signed in to change notification settings - Fork 4
Fix relay failover mechanism #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Conversation
…d of context deadline
I think maybe a more direct behavior could resolve lots of the complexity in the PR func (p *Peer) send(ctx context.Context, request *types.Envelope) error {
sendCtx, cancel := context.WithDeadline(ctx,...)
defer cancel()
select {
case con := <-p.healthyConns:
err := con.send(sendCtx, bytes)
if err != nil {
p.penalize(con)
return err // Fail fast, now the caller can decide to retry
}
p.healthyConns <- con
return nil
case <-sendCtx.Done():
return fmt.Errorf("something something: %w", sendCtx.Err())
}
}
func (p *Peer) penalize(con *InnerConnection) {
time.AfterFunc(p.cooldown, func() {
select {
case p.healthyConns <- con:
default: // log that the peer is shutdown
}
})
} |
Thanks for the feeedback. I see the logic, but with a few connections available, I’m concerned about scenarios, such as when there is only one available connection, or losing network connectivity temporaliy leading to penalize all connections. In such scenarios, exclusion could lead to unnecessary downtime since there would be no connection left to attempt a send operation until the cooldown expires. The primary focus of this PR is to improve the reliability of the failover mechanism and IMO, deprioritizing failed connections (not excluding) would be more resilient, as it keeps all options available for retries/failover. I'm not particularly attached to the idea of scoring the connections in this specfic scope (it was already there, so I kept it). Open to further thoughts! |
rmb-sdk-go/peer/relay_retry_set.go
Outdated
// T must be a pointer type. Pointer value comparison is used. | ||
func (s *CooldownRelaySet[T]) MarkFailure(relay T, now time.Time) { | ||
for i := range s.Relays { | ||
if reflect.ValueOf(s.Relays[i].Relay).Pointer() == reflect.ValueOf(relay).Pointer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan, I believe this could be overly simplified but having that in penalized list and the keeping the healthy ones in the healthy list and shuffle over both. trying first from the healthy then from the penalized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the code by using *InnerConnection directly, and removing reflection-based comparison to improve readability. The code should now be safer and more straightforward to understand, while functionality remains intact.
The current implementation is simpler than maintaining two separate lists. By using atomic operations on the LastErrorAt
field, we've simplified the code while making it more robust.
Instead of juggling relays between healthy and penalized lists, which would require careful synchronization using a mutex, we now have a single source of truth for each relay's state. The atomic operations ensure (lock-free) thread safety without the overhead of locks during normal operation, and we avoid constantly reallocating slices.
The beauty of this approach is in its simplicity. We can mark a relay as failed or successful with a single atomic store operation, and the Sorted()
method efficiently organizes the relays based on their current effective state. IMO, this makes the code easier to reason about while being more performant and simpler than the two-list approach.
} | ||
|
||
// WithRelayCooldown sets the cooldown duration for relay failover and retry logic. | ||
// If not set, defaults to 10 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 10 seconds is much? no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No at all. The cool down will affect only the order in which relays would be tried out
} | ||
return nil | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too, I fear that affects rmb responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very brief pause after exhausting all the relays to prevent a busy loop. If removed, and the connections fail immediately (e.g., the machine is not yet connected to the internet), we won’t leave any breathing space for the CPU to perform other tasks.
…adability and generic type for direct InnerConnection usage
Description
The original WeightSlice implementation had a fundamental flaw where it used static weighted-random selection that didn't adapt to relay failures, causing "sticky" relay behavior. Also, it didn't behave as expected. see this issue. The cooldown-based relay failover mechanism proposed in this PR is fair, configurable, and thread-safe. It provides improvement over the original weighted random approach and will ensure reliable relay failover in production environments.
Also, some optimization and validation were introduced:
Related Issues: