Skip to content

🐞 [Bug]: Peer can deadlock, stalls and timeouts under burst load when handler is slow or waits for responses #1407

@sameh-farouk

Description

@sameh-farouk

What happened?

Description
Under bursty traffic, the peer can stall or deadlock when a handler sends many requests and synchronously waits for their responses. The synchronous Peer.process() loop stops draining reader, causing InnerConnection.loop() to block on forwarding inbound frames.
This is not the case only when the handler is waiting for its responses. If the handler is slow, you won’t get a deadlock, but you can still get Periodic stalls or reconnects and lose messages.

Details
1. Peer.process() callsd.handler(...) inline. While the handler is blocked waiting on its responses,
process() is not draining d.reader.
2. InnerConnection.loop() attempts output <- data (where output is Peer.reader) and blocks because
reader channel is unbuffered.
3. The response the handler is waiting for gets stuck behind the blocked send, creating a stall/deadlock.
4. Even if the handler is not waiting for its own responses but is slow, still, while the handler runs, it isn’t draining reader, the connection loop blocks, pausing ping/pong, increasing latency, which causes messages to timeout and
risking reconnects if it blocks long enough

Root Cause

  • Tight coupling: Peer.process() executes handlers inline, so it stops draining the reader channel while the handler runs.
  • Backpressure: InnerConnection.loop() forwards inbound frames into reader. When process() is in a handler, the channel is not drained, causing the connection loop to block on the block-send.
  • The problem manifests at higher concurrency (message rate) when implicit buffers (socket, websocket, internal channel) are exhausted.

Proposed Fixes

  • Decouple handler execution from process(): Run d.handler(...) in a goroutine or via a bounded worker pool
  • Bound handler concurrency using a semaphore to limit in-flight work to a safe number, preventing sustained backpressure.
  • Buffer the peer’s inbound channel: Change reader to a buffered channel (tune size).

which network/s did you face the problem on?

Dev

Twin ID/s

No response

Version

No response

Node ID/s

No response

Farm ID/s

No response

Contract ID/s

No response

Relevant log output

NA

Metadata

Metadata

Assignees

Labels

rmb-sdkbelongs to rmbtype_bugSomething isn't working

Type

Projects

Status

Pending Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions