Streams: faster response to potential network partitions

### Is your feature request related to a problem? Please describe.

Streams rely on erlang monitors for failure detection which in turn rely on the `net_ticktime` and `net_intensity` settings. This means it could take around 1 minute to detect that a writer is on a node that is no longer responding (either by being partitioned or because it had been force killed and the other nodes did not detect yet that the TCP connection has gone down).

Ra uses the `aten` library for failure detection and thus can fail over more quickly and regain availability. Ra can handle false positives very well where streams have a less light weight election process (the stream cluster needs to be restarted) and thus we still want to ensure we try very hard to act on false positives. That said it should be possible to do better than 1 minute by default. It is likely we can never reliably go as low as Ra which normally completes a new election within 5-10 seconds when a leader is partitioned off.

### Describe the solution you'd like

When the stream coordinator becomes leader it spawns a companion process which will register for node notifications from the `aten` application. When it receives a node down it can, like Ra, wait one aten poll interval and check if the node that is assumed to be down is still in `nodes()`. If it is and aten still thinks it is down it is likely it has been partitioned off (or hard killed). In this case this process can query the stream coordinator for all streams that have leaders (writers) on this node.

After receiving the results the process can check again if `aten` thinks the node is down and issue a restart stream command against the stream coordinator for each stream with a leader on the down node to force a new stream election.

If at any point the process detects that the local stream coordinator member is no longer the leader it should cease operations and hibernate until notified.

the stream coordinator will activate the local process by sending a message to it from the `state_enter(leader, ...` callback.

### Describe alternatives you've considered

_No response_

### Additional context

[_No response_](https://github.com/rabbitmq/rabbitmq-stream-java-client/discussions/395)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streams: faster response to potential network partitions #9275

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streams: faster response to potential network partitions #9275

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions