Skip to content

Conversation

sameh-farouk
Copy link
Member

@sameh-farouk sameh-farouk commented Jul 23, 2025

Description

The original WeightSlice implementation had a fundamental flaw where it used static weighted-random selection that didn't adapt to relay failures, causing "sticky" relay behavior. Also, it didn't behave as expected. see this issue. The cooldown-based relay failover mechanism proposed in this PR is fair, configurable, and thread-safe. It provides improvement over the original weighted random approach and will ensure reliable relay failover in production environments.

  • Replaced WeightSlice with CooldownRelaySet that provides fair, adaptive relay ordering
  • Failed relays are penalized with a cooldown period where it would be deprioritized but not excluded
  • All relays get fair chances based on their recent performance
  • Fair retry until envelope expiration

Also, some optimization and validation were introduced:

  • Bumping the send timeout from 2 to 5 seconds (per relay attempt). The ideal value is always specific to the environment, but 5 seconds IMO is a more conservative and safer starting point than 2 seconds. It strikes a more balanced compromise, favoring stability and resilience in the face of real-world server conditions, while the cost of a 3-second slower reaction to a truly dead connection is acceptable for the vast majority of applications.
  • Validate incoming messages' expiration, and prevent further processing if the message has expired already. This saves resources and is considered a good security practice.

Related Issues:

@xmonader
Copy link
Contributor

I think maybe a more direct behavior could resolve lots of the complexity in the PR

func (p *Peer) send(ctx context.Context, request *types.Envelope) error {

    sendCtx, cancel := context.WithDeadline(ctx,...)
    defer cancel()

    select {
    case con := <-p.healthyConns:
        err := con.send(sendCtx, bytes)
        if err != nil {
            p.penalize(con) 
            return err // Fail fast, now the caller can decide to retry
        }
        p.healthyConns <- con
        return nil
    case <-sendCtx.Done():
        return fmt.Errorf("something something: %w", sendCtx.Err())
    }
}

func (p *Peer) penalize(con *InnerConnection) {
    time.AfterFunc(p.cooldown, func() {
        select {
        case p.healthyConns <- con:
        default: // log that the peer is shutdown
        }
    })
}

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Jul 27, 2025

I think maybe a more direct behavior could resolve lots of the complexity in the PR

Thanks for the feeedback. I see the logic, but with a few connections available, I’m concerned about scenarios, such as when there is only one available connection, or losing network connectivity temporaliy leading to penalize all connections. In such scenarios, exclusion could lead to unnecessary downtime since there would be no connection left to attempt a send operation until the cooldown expires.

The primary focus of this PR is to improve the reliability of the failover mechanism and IMO, deprioritizing failed connections (not excluding) would be more resilient, as it keeps all options available for retries/failover.

I'm not particularly attached to the idea of scoring the connections in this specfic scope (it was already there, so I kept it).
Personally, if simplifications a top priority here, I'd prefer a simple shuffle and loop approach over both original random selection and your suggestion. What really matters to me is ensuring that all connections are retried.

Open to further thoughts!

// T must be a pointer type. Pointer value comparison is used.
func (s *CooldownRelaySet[T]) MarkFailure(relay T, now time.Time) {
for i := range s.Relays {
if reflect.ValueOf(s.Relays[i].Relay).Pointer() == reflect.ValueOf(relay).Pointer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan, I believe this could be overly simplified but having that in penalized list and the keeping the healthy ones in the healthy list and shuffle over both. trying first from the healthy then from the penalized

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code by using *InnerConnection directly, and removing reflection-based comparison to improve readability. The code should now be safer and more straightforward to understand, while functionality remains intact.

The current implementation is simpler than maintaining two separate lists. By using atomic operations on the LastErrorAt field, we've simplified the code while making it more robust.

Instead of juggling relays between healthy and penalized lists, which would require careful synchronization using a mutex, we now have a single source of truth for each relay's state. The atomic operations ensure (lock-free) thread safety without the overhead of locks during normal operation, and we avoid constantly reallocating slices.

The beauty of this approach is in its simplicity. We can mark a relay as failed or successful with a single atomic store operation, and the Sorted() method efficiently organizes the relays based on their current effective state. IMO, this makes the code easier to reason about while being more performant and simpler than the two-list approach.

}

// WithRelayCooldown sets the cooldown duration for relay failover and retry logic.
// If not set, defaults to 10 seconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 10 seconds is much? no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No at all. The cool down will affect only the order in which relays would be tried out

}
return nil
time.Sleep(100 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too, I fear that affects rmb responses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very brief pause after exhausting all the relays to prevent a busy loop. If removed, and the connections fail immediately (e.g., the machine is not yet connected to the internet), we won’t leave any breathing space for the CPU to perform other tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants