Fix relay failover mechanism #1391

sameh-farouk · 2025-07-23T10:53:25Z

Description

The original WeightSlice implementation had a fundamental flaw where it used static weighted-random selection that didn't adapt to relay failures, causing "sticky" relay behavior. Also, it didn't behave as expected. see this issue. The cooldown-based relay failover mechanism proposed in this PR is fair, configurable, and thread-safe. It provides improvement over the original weighted random approach and will ensure reliable relay failover in production environments.

Replaced WeightSlice with CooldownRelaySet that provides fair, adaptive relay ordering
Failed relays are penalized with a cooldown period where it would be deprioritized but not excluded
All relays get fair chances based on their recent performance
Fair retry until envelope expiration

Also, some optimization and validation were introduced:

Bumping the send timeout from 2 to 5 seconds (per relay attempt). The ideal value is always specific to the environment, but 5 seconds IMO is a more conservative and safer starting point than 2 seconds. It strikes a more balanced compromise, favoring stability and resilience in the face of real-world server conditions, while the cost of a 3-second slower reaction to a truly dead connection is acceptable for the vast majority of applications.
Validate incoming messages' expiration, and prevent further processing if the message has expired already. This saves resources and is considered a good security practice.

Related Issues:

🐞 [Bug]: RMB - Flaw in Go Peer's Relay Failover and Selection Strategy #1388

…logic

…d of context deadline

…iles

xmonader · 2025-07-26T12:58:23Z

I think maybe a more direct behavior could resolve lots of the complexity in the PR

func (p *Peer) send(ctx context.Context, request *types.Envelope) error {

    sendCtx, cancel := context.WithDeadline(ctx,...)
    defer cancel()

    select {
    case con := <-p.healthyConns:
        err := con.send(sendCtx, bytes)
        if err != nil {
            p.penalize(con) 
            return err // Fail fast, now the caller can decide to retry
        }
        p.healthyConns <- con
        return nil
    case <-sendCtx.Done():
        return fmt.Errorf("something something: %w", sendCtx.Err())
    }
}

func (p *Peer) penalize(con *InnerConnection) {
    time.AfterFunc(p.cooldown, func() {
        select {
        case p.healthyConns <- con:
        default: // log that the peer is shutdown
        }
    })
}

sameh-farouk · 2025-07-27T08:46:54Z

I think maybe a more direct behavior could resolve lots of the complexity in the PR

Thanks for the feeedback. I see the logic, but with a few connections available, I’m concerned about scenarios, such as when there is only one available connection, or losing network connectivity temporaliy leading to penalize all connections. In such scenarios, exclusion could lead to unnecessary downtime since there would be no connection left to attempt a send operation until the cooldown expires.

The primary focus of this PR is to improve the reliability of the failover mechanism and IMO, deprioritizing failed connections (not excluding) would be more resilient, as it keeps all options available for retries/failover.

I'm not particularly attached to the idea of scoring the connections in this specfic scope (it was already there, so I kept it).
Personally, if simplifications a top priority here, I'd prefer a simple shuffle and loop approach over both original random selection and your suggestion. What really matters to me is ensuring that all connections are retried.

Open to further thoughts!

xmonader · 2025-07-28T09:14:54Z

rmb-sdk-go/peer/relay_retry_set.go

+// T must be a pointer type. Pointer value comparison is used.
+func (s *CooldownRelaySet[T]) MarkFailure(relay T, now time.Time) {
+	for i := range s.Relays {
+		if reflect.ValueOf(s.Relays[i].Relay).Pointer() == reflect.ValueOf(relay).Pointer() {


I'm not a fan, I believe this could be overly simplified but having that in penalized list and the keeping the healthy ones in the healthy list and shuffle over both. trying first from the healthy then from the penalized

I simplified the code by using *InnerConnection directly, and removing reflection-based comparison to improve readability. The code should now be safer and more straightforward to understand, while functionality remains intact.

The current implementation is simpler than maintaining two separate lists. By using atomic operations on the LastErrorAt field, we've simplified the code while making it more robust.

Instead of juggling relays between healthy and penalized lists, which would require careful synchronization using a mutex, we now have a single source of truth for each relay's state. The atomic operations ensure (lock-free) thread safety without the overhead of locks during normal operation, and we avoid constantly reallocating slices.

The beauty of this approach is in its simplicity. We can mark a relay as failed or successful with a single atomic store operation, and the Sorted() method efficiently organizes the relays based on their current effective state. IMO, this makes the code easier to reason about while being more performant and simpler than the two-list approach.

rawdaGastan · 2025-07-28T13:16:06Z

rmb-sdk-go/peer/peer.go

+}
+
+// WithRelayCooldown sets the cooldown duration for relay failover and retry logic.
+// If not set, defaults to 10 seconds.


I think 10 seconds is much? no?

No at all. The cool down will affect only the order in which relays would be tried out

rawdaGastan · 2025-07-28T13:31:47Z

rmb-sdk-go/peer/peer.go

 		}
-		return nil
+		time.Sleep(100 * time.Millisecond)


here too, I fear that affects rmb responses

This is a very brief pause after exhausting all the relays to prevent a busy loop. If removed, and the connections fail immediately (e.g., the machine is not yet connected to the internet), we won’t leave any breathing space for the CPU to perform other tasks.

…adability and generic type for direct InnerConnection usage

…x-issue1388

sameh-farouk added 5 commits July 23, 2025 03:22

refactor: replace weighted relay selection with cooldown-based retry …

ab10c8b

…logic

fix: use pointer to InnerConnection in relay set to prevent copying

be36109

feat: add envelope expiration check with unit tests

d468152

fix: use request timestamp and expiration for sending deadline instea…

c090bb5

…d of context deadline

fix: increase connection timeout from 2 to 5 seconds

cfce3bc

sameh-farouk requested review from mariobassem, ashraffouda, rawdaGastan, Omarabdul3ziz, AbdelrahmanElawady, xmonader and Eslam-Nawara as code owners July 23, 2025 10:53

style: fix formatting and remove unused imports across peer package f…

c211110

…iles

xmonader reviewed Jul 28, 2025

View reviewed changes

rawdaGastan reviewed Jul 28, 2025

View reviewed changes

sameh-farouk added 3 commits August 3, 2025 16:17

refactor: remove reflection usage from CooldownRelaySet for better re…

8f7d0db

…adability and generic type for direct InnerConnection usage

Merge remote-tracking branch 'origin/development' into development-fi…

8f9f910

…x-issue1388

feat: improve RMB peer connection handling with atomic state tracking

b0ec49c

sameh-farouk mentioned this pull request Aug 13, 2025

Clean shutdown, relay error handling, and logging improvements #1409

Open

4 tasks

sameh-farouk mentioned this pull request Sep 18, 2025

RMB Peer Fixes #1437

Open

rawdaGastan approved these changes Sep 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix relay failover mechanism #1391

Fix relay failover mechanism #1391

sameh-farouk commented Jul 23, 2025 •

edited

Loading

Uh oh!

xmonader commented Jul 26, 2025

Uh oh!

sameh-farouk commented Jul 27, 2025 •

edited

Loading

Uh oh!

xmonader Jul 28, 2025

Uh oh!

sameh-farouk Aug 4, 2025

Uh oh!

rawdaGastan Jul 28, 2025

Uh oh!

sameh-farouk Jul 29, 2025

Uh oh!

rawdaGastan Jul 28, 2025

Uh oh!

sameh-farouk Jul 29, 2025

Uh oh!

Uh oh!

Fix relay failover mechanism #1391

Are you sure you want to change the base?

Fix relay failover mechanism #1391

Conversation

sameh-farouk commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

xmonader commented Jul 26, 2025

Uh oh!

sameh-farouk commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmonader Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sameh-farouk Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

rawdaGastan Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sameh-farouk Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

rawdaGastan Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sameh-farouk Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sameh-farouk commented Jul 23, 2025 •

edited

Loading

sameh-farouk commented Jul 27, 2025 •

edited

Loading