Skip to content

Conversation

Roasbeef
Copy link
Member

I got a bit nerd sniped and entered optimizoooor mode.

In this PR, we eliminate all the allocations from the WriteMessage+Flush methods.

To achieve this, we mainly ensure that we re-use static buffers, and that none of those buffers escape to the heap. I also added some documentation that describes the set of tools/techniques I used, so future devs/contributors can follow a similar optimizooooor loop.

The the start of this commit series, here's the benchmark output:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   22736	     50667 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23217	     50463 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24241	     49941 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23574	     51021 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23784	     49926 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24230	     50538 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24058	     49971 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23762	     50224 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24266	     53034 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23317	     50045 ns/op	   73788 B/op	       5 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	17.433s

At the end of the series, here's where we stand:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   25840	     46376 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25646	     46672 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25874	     46391 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25512	     46427 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25760	     46309 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25789	     46520 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25602	     46619 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25766	     46464 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25820	     46487 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25634	     46553 ns/op	       2 B/op	       0 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	16.907s

Comparing them, we see all allocations have been eliminated:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
                │   old.txt   │             final.txt              │
                │   sec/op    │   sec/op     vs base               │
WriteMessage-16   50.34µ ± 1%   46.48µ ± 0%  -7.68% (p=0.000 n=10)

                │    old.txt     │              final.txt              │
                │      B/op      │    B/op     vs base                 │
WriteMessage-16   73788.000 ± 0%   2.000 ± 0%  -100.00% (p=0.000 n=10)

                │  old.txt   │              final.txt              │
                │ allocs/op  │ allocs/op   vs base                 │
WriteMessage-16   5.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)

@Roasbeef Roasbeef added golang/build system Related to the go language and compiler performance docs no-changelog labels Aug 30, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a significant performance optimization by refactoring the message writing and flushing mechanisms in the brontide package. The core objective was to remove all dynamic memory allocations from these high-frequency operations, leading to reduced garbage collection overhead and more predictable latency, thereby enhancing the overall efficiency and responsiveness of the system.

Highlights

  • Zero Allocations: Eliminated all memory allocations from the WriteMessage and Flush methods within the brontide package, a critical hot path for message handling.
  • Performance Improvement: Achieved a 7.68% speed improvement in WriteMessage operations, as validated by comprehensive benchmarks.
  • Buffer Reuse: Implemented static buffer reuse for nonces, encrypted headers, and message bodies to prevent heap allocations and improve efficiency.
  • Documentation: Added a new detailed guide, docs/benchmark_perf_loop.md, explaining the Go performance optimization loop, including benchmarking, profiling, escape analysis, and zero-allocation strategies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an excellent optimization effort that successfully eliminates all allocations from the WriteMessage and Flush methods. The use of pre-allocated buffers on the Machine and cipherState structs is a great application of common Go performance patterns. The new benchmark correctly validates the improvements. The new documentation file is a fantastic and detailed guide for future performance work. I have a few suggestions for the code and documentation to improve clarity, correctness, and adherence to the project's style guide. My feedback includes a point about a missing function comment as required by the style guide (lines 15-16).

@Roasbeef Roasbeef force-pushed the brontide-static-alloac branch from e8a612f to e829840 Compare August 30, 2025 14:21
In this commit, we add a benchmark for the WriteMessage method. This is
the first step in an upcoming optimizooor series of commits.

The baseline is:
```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   22736	     50667 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23217	     50463 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24241	     49941 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23574	     51021 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23784	     49926 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24230	     50538 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24058	     49971 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23762	     50224 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24266	     53034 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23317	     50045 ns/op	   73788 B/op	       5 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	17.433s
```
In this commit, we create fized sized arrays for `nextHeaderSend` and
`nextBodySend`. This enables us to reuse these buffers, as we can pass
them into Encrypt/Seal directly, which avoids allocations for each write
that we do.

After this patch, we see the benchmarks improve:
```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   25198	     46795 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   26331	     46186 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   25608	     46308 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   25849	     46302 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   26318	     46285 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   25622	     46147 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   26287	     46506 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   26155	     46827 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   25256	     46910 ns/op	      34 B/op	       3 allocs/op
BenchmarkWriteMessage-16    	   25318	     46876 ns/op	      34 B/op	       3 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	17.092s
```

```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
                │   old.txt   │              new.txt               │
                │   sec/op    │   sec/op     vs base               │
WriteMessage-16   50.34µ ± 1%   46.41µ ± 1%  -7.82% (p=0.000 n=10)

                │    old.txt    │              new.txt               │
                │     B/op      │    B/op     vs base                │
WriteMessage-16   73788.00 ± 0%   34.00 ± 0%  -99.95% (p=0.000 n=10)

                │  old.txt   │              new.txt               │
                │ allocs/op  │ allocs/op   vs base                │
WriteMessage-16   5.000 ± 0%   3.000 ± 0%  -40.00% (p=0.000 n=10)
```
…achine

In this commit, we use a fixed sized buffer for the nonce when we
read/write messages. This was actually escaping to the heap. We can
avoid this by statically allocating it alongside the struct itself.

The benchmark state at this point:

```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   25264	     47012 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   23542	     46809 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   25989	     47256 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   25542	     46388 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   26083	     46612 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   25860	     46367 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   24967	     46748 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   26088	     46485 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   25561	     46425 ns/op	       4 B/op	       1 allocs/op
BenchmarkWriteMessage-16    	   25474	     47249 ns/op	       4 B/op	       1 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	16.911s
```

```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
                │   old.txt   │              new2.txt              │
                │   sec/op    │   sec/op     vs base               │
WriteMessage-16   50.34µ ± 1%   46.68µ ± 1%  -7.28% (p=0.000 n=10)

                │    old.txt     │              new2.txt              │
                │      B/op      │    B/op     vs base                │
WriteMessage-16   73788.000 ± 0%   4.000 ± 0%  -99.99% (p=0.000 n=10)

                │  old.txt   │              new2.txt              │
                │ allocs/op  │ allocs/op   vs base                │
WriteMessage-16   5.000 ± 0%   1.000 ± 0%  -80.00% (p=0.000 n=10)
```
In this commit, we eliminate the final allocation that takes place when
we write out messages. Once again this was escaping to the heap, so we
make it an attribute on the Machien struct, which allows pure static
allocation.

```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   25840	     46376 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25646	     46672 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25874	     46391 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25512	     46427 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25760	     46309 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25789	     46520 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25602	     46619 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25766	     46464 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25820	     46487 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25634	     46553 ns/op	       2 B/op	       0 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	16.907s
```

With this, we now have eliminated 100% of the allocations!

```
goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
                │   old.txt   │             final.txt              │
                │   sec/op    │   sec/op     vs base               │
WriteMessage-16   50.34µ ± 1%   46.48µ ± 0%  -7.68% (p=0.000 n=10)

                │    old.txt     │              final.txt              │
                │      B/op      │    B/op     vs base                 │
WriteMessage-16   73788.000 ± 0%   2.000 ± 0%  -100.00% (p=0.000 n=10)

                │  old.txt   │              final.txt              │
                │ allocs/op  │ allocs/op   vs base                 │
WriteMessage-16   5.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)
```
In this commit, we add some docs that explain how to use tools like
heap escape analysis and memory profiling to fully eliminate allocations
in a sample program.

This guide is meant to help devs/contributors use Go's excellent perf
tools to zoom in on an optimization problem.
@Roasbeef Roasbeef force-pushed the brontide-static-alloac branch from e829840 to 21c21b7 Compare August 30, 2025 14:51
@yyforyongyu yyforyongyu added this to the v0.20.0 milestone Sep 1, 2025
@yyforyongyu
Copy link
Member

yyforyongyu commented Sep 1, 2025

/gemini review

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive results from a simple tweak! Just need to fix the CI, otherwise LGTM🙏

@@ -0,0 +1,378 @@
# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@morehouse
Copy link
Collaborator

So the net effect is:

  • We save 4 microseconds per message sent.
  • We preallocate 64 KB for every peer.

I'm not sure this change is a good tradeoff... Network RTTs are already orders of magnitude higher than the performance improvement, even within the same datacenter. A 4 us improvement is lost in the noise. Meanwhile we increase typical memory usage by 64 KB per peer.

@saubyk saubyk requested a review from gijswijs September 2, 2025 17:06
@Roasbeef
Copy link
Member Author

Roasbeef commented Sep 2, 2025

So the net effect is:
We save 4 microseconds per message sent.

This is more about entirely eliminating allocations when we write messages, the minor (~7%) speed up isn't the main target at all.

We preallocate 64 KB for every peer.

Correct that we pre-allocate 64 KB per peer, but this memory allocation is now bounded, and doesn't need to later be marked by the GC to be freed (cycles saved traversing the object graph). Memory utilization will now be smoothed out, dampening the peaks you'd see in a typical graph of allocated memory over time for GC'd languages. A node with 1k active connection does indeed see static increase of 64 MB, but that doesn't tell the whole story.

Before we'd allocate each time (5x per write message) each time we went to write out a message to the peer (individual buffers for: nonce, header, body, and packet size -- so more objects for the GC to chase).

Rather than focusing on that single snapshot of allocation, we should consider the rate of allocation over time. If a single peer sends us a 65 KB message, each time we're generating garbage that later needs to be cleaned up, which adds overhead for the GC. The overhead we save isn't necessarily showing up in this profile, as it's a short lived benchmark. On the macro level, we increase the overall performance of the entire daemon, as less time is spent on GC cycles (stop-the-world).

@saubyk saubyk added this to lnd v0.20 Sep 2, 2025
@saubyk saubyk moved this to In progress in lnd v0.20 Sep 2, 2025
@saubyk saubyk moved this from In progress to In review in lnd v0.20 Sep 2, 2025
@morehouse
Copy link
Collaborator

Correct that we pre-allocate 64 KB per peer, but this memory allocation is now bounded, and doesn't need to later be marked by the GC to be freed (cycles saved traversing the object graph). Memory utilization will now be smoothed out, dampening the peaks you'd see in a typical graph of allocated memory over time for GC'd languages. A node with 1k active connection does indeed see static increase of 64 MB, but that doesn't tell the whole story.

Sure, but dampening the peaks by raising the water level isn't always good. e.g., that's why we don't preallocate 10 GB at LND startup.

These allocations are typically very short lived -- they only need to exist long enough to send the outgoing message. They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.

Rather than focusing on that single snapshot of allocation, we should consider the rate of allocation over time. If a single peer sends us a 65 KB message, each time we're generating garbage that later needs to be cleaned up, which adds overhead for the GC. The overhead we save isn't necessarily showing up in this profile, as it's a short lived benchmark. On the macro level, we increase the overall performance of the entire daemon, as less time is spent on GC cycles (stop-the-world).

Can we write a better benchmark then? Or get a before-and-after full node CPU profile under typical conditions and also under heavy load? Let's use data to drive the optimizations rather than guessing.

In this commit, we address a suuuper old TODO to reset the brontide
state. We were allocating a read buf which we now set to nil, and we
free up the original brontide reference.
Copy link
Collaborator

@gijswijs gijswijs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.

You're not wrong, but the stated goal is zero (heap) allocs, so then you will have to pre-allocate the max message size.

- The percentage change (vs base column) showing the magnitude of improvement

- The p-value (p=0.000) indicating statistical significance - values below 0.05
suggest real improvements rather than noise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Bayesian in me disagrees, but let's keep frequentist frame of reference for the statistics.

This is a great guide by the way.

@morehouse
Copy link
Collaborator

You're not wrong, but the stated goal is zero (heap) allocs, so then you will have to pre-allocate the max message size.

Ultimately the goal is improved performance. Eliminating heap allocs is one way to affect that goal, but this is likely a case where there's no performance benefit 99.9% of the time since GC is already optimized efficient for the small, infrequent, short-lived allocations that are being eliminated.

There may actually be a performance regression when not under DoS attack, since we're now creating large long-lived allocations for every peer instead of small short-lived ones. Every time GC occurs, those large allocations need to be scanned in their entirety, causing GC pauses to last longer.

And regardless of the performance improvement or regression, there's a guaranteed cost in memory overhead that this change introduces. This overhead needs to be weighed against whatever potential performance benefit there may be.

This is why we need good performance data to drive the optimization cycle. Guessing often produces incorrect results, while increasing code complexity.

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR aims at adding a worst-case bound, that is 65KB per peer, for the encryption of wire msgs. I think it's a good starting point given this is how we handle writing plaintext, however, I overlooked that we use a pool instead of static upfront allocations,

lnd/peer/brontide.go

Lines 2637 to 2652 in 9d74ec4

// Otherwise, this is a new message. We'll acquire a write buffer to
// serialize the message and buffer the ciphertext on the connection.
err := p.cfg.WritePool.Submit(func(buf *bytes.Buffer) error {
// Using a buffer allocated by the write pool, encode the
// message directly into the buffer.
_, writeErr := lnwire.WriteMessage(buf, msg, 0)
if writeErr != nil {
return writeErr
}
// Finally, write the message itself in a single swoop. This
// will buffer the ciphertext on the underlying connection. We
// will defer flushing the message until the write pool has been
// released.
return noiseConn.WriteMessage(buf.Bytes())
})

When encoding the plaintext, we will take a buffer from the write pool, and return it once done. The size of the buffer is 65KB, but it's very different from static upfront allocations - if we have 10 peers writing msg at the same time out of 1000 peers, based on this PR we'd need 65 MB memory allocation, while we only need 650 KB when using the write pool.

Given we already have this write pool, which can allocate 100 objects at max, so the memory is bounded at 6.5 MB; and it has its own dynamic GC, which release the buffer when it's not used for 30s, I think we can instead use it, tho the current WritePool needs to have more methods than a single Submit - in specific, it needs have a Take and a Return, which has already been built in bufferPool, we just need to expose it in WritePool. Then the writeMessage shown above will:

  1. take a buffer from the pool when writing the msg, which includes encoding and encryption.
  2. return the buffer when it's flushed.

The other aspect I didn't realized was the increased GC pressure - on its own it seems like we can now just freeze a node by increasing the num of peers, but that num is restricted already.

@Roasbeef
Copy link
Member Author

Roasbeef commented Sep 6, 2025

Plan to run this across a few diff nodes before landing.

Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NACK.

The current benchmark suggests a maximum 7% performance improvement in extremely pathological cases (constant streaming of 64KB messages to the peer). The benchmark is extremely unrealistic, even under a DoS attack, because in practice Flush actually waits for at least the network RTT before returning, which takes ~1000x more time than the benchmark currently takes into account. This suggests the actual performance improvement under DoS is basically 0%.

@Roasbeef mentioned that there's other benefit not captured by the benchmark, since GC likely doesn't happen during the short-lived benchmark. ISTM the benchmark already captures GC behavior, since it sends data at a rate higher than 1 GB/s based on the benchmark data in the OP. Even running the benchmark for just 1 second likely causes GC to run multiple times.

So zero benefit, plus ~30x increase in per-Brontide memory is a loss. Until I see data suggesting otherwise, I can't get on board with this.

Finally, the latest commit introduced an extremely trivial DoS vector. Oops.

Comment on lines 236 to +240
func (c *Conn) Close() error {
// TODO(roasbeef): reset brontide state?
// Clear the state we created to be able to handle this connection.
c.noise = nil
c.readBuf = bytes.Buffer{}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a null deref waiting to happen. It is common for Brontide goroutines to continue calling Conn methods after the connection has been closed.

Copy link
Member Author

@Roasbeef Roasbeef Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently disconnect is guarded by an atomic variable, so callers to it will only succeed once.

If I look at the other callers into p.conn, I see one one the string method. But not many others.

In either case, we can have a similar CAS check on calls into the connection after a it's been closed. Thanks for pointing this out, I hadn't got around to testing this yet on an actual node.

@Roasbeef
Copy link
Member Author

Roasbeef commented Sep 11, 2025

Given we already have this write pool, which can allocate 100 objects at max, so the memory is bounded at 6.5 MB; and it has its own dynamic GC, which release the buffer when it's not used for 30s, I think we can instead use it, tho the current WritePool needs to have more methods than a single Submit - in specific, it needs have a Take and a Return, which has already been built in bufferPool, we just need to expose it in WritePool.

I think this makes a lot of sense. I'll take a stab at this implementation. I think the smaller static buffers can still be kept as part of the brontide.Machine struct.

I've been running a modified (few diff iterations) against some live nodes and there's a clear difference between this and the current state of master. I think your idea is even better though, as it reuses a system we already have to reuse buffers for serializing messages, and we can extend it to use it for the ciphertexts as well.

@Roasbeef
Copy link
Member Author

Thanks everyone for all the feedback.

I've run some benchmarks with 3 variants of this PR:

  1. One that fixes the issue Morehouse pointed out above, using an atomic pointer to swap out the brontide.Machine pointer: master...Roasbeef:lnd:brontide-fix-nil.
  2. One that uses yy's suggestion to just hook into the existing buffer pool: master...Roasbeef:lnd:brontide-buf-pool.
  3. A variant on the above, that modifies the pool (temp patch here, has wider implications) to use 2x the number of restricted connections: master...Roasbeef:lnd:brontide-buf-fixed-pool.

PR as is

With that extra commit, when run with a stress test program (thousands of peers), the PR holds up fairly well. Compared to the others, it allocates more when calling brontide.NewMachine.

Here's a flame graph after running the stress test program for a few minutes:
Screenshot 2025-09-11 at 3 55 50 PM

In this case, a lot of the overhead is actually creating the peer struct.

PR with buf pools

When adding the buf pools, the issue is that even though we create 100 workers, the pool will continue to allocate new buffers. We use a very small timeout of just 1 ms, so most of the time it drops down to create a new set of buffers. So although the buffers are re-used we have a lot of allocations as each time we're basically allocating entirely new buffers.

You can see this clearly in the flamegraph:

Screenshot 2025-09-11 at 4 01 47 PM

PR with fixed buf pools

For this variant, I increased the returnQueue size to 2x the number of restricted connections. Then I modified the Take() logic to basically block forever until a new buffer is created.

This performs better than w/o these changes, as now we're forcing a smaller working set. This has some tradeoffs, as write x-put with many peers may be hampered as now we'll block for new buffers instead of creating them as needed.

Screenshot 2025-09-11 at 4 42 55 PM

Here's a slop report generated from the raw profile files: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f

Thoughts?

Copy link
Member

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I prefer the last one brontide-buf-fixed-pool as it eliminates memory allocations, reducing GC pressure and improving latency and improves performance under load: more predictable memory allocation and another on leaking buffers from the pool when peers disconnect abruptly. However, the 24-hour timeout seems too aggressive and can be replaced with a much short value like seconds.

@morehouse
Copy link
Collaborator

@Roasbeef: Thanks for collecting more data. I don't think the data addresses any of my expressed concerns though. Key information needed to properly optimize:

  • What is the "stress test program"?
  • How do the variants handle regular traffic (no stress test)?
  • A control group (i.e. master branch) is needed.
  • CPU profiles of each variant are critical to determine if the CPU-memory tradeoff is good.
  • Ideally provide the SVGs for the web visualizations of all CPU and heap profiles, so we can study them ourselves.

I expressed concerns that this PR increases memory usage vs master. So we need memory usage data for master to properly evaluate that.

I also expressed concerns that this PR will perform poorly under regular traffic, potentially even increasing CPU usage. We need CPU profiles to evaluate that.

We especially need CPU data for time spent during GC. For example, if the master branch only spends 5% of its CPU time in GC, then eliminating allocations across all of LND could only increase performance by ~5%! If that's the case, increasing code complexity to eliminate a small fraction of allocations was silly from the start, and all the discussion on this PR has wasted everyone's time.


@yyforyongyu: I am skeptical that using the buffer pool will provide good memory or CPU benefit in practice. Why create our own garbage collection scheme instead of relying on Go's optimized garbage collector? Implementing a GC is tricky, which @Roasbeef's "PR with buf pools" clearly demonstrates -- it increases memory usage due to slow memory release.

The "PR with fixed buf pools" makes a memory-time tradeoff --- it imposes a strong memory limit by introducing blocking behavior. We really need data about how much additional latency is added by this change to decide if the tradeoff is worthwhile. Since previously there was a maximum 1ms pause and now it blocks forever, this could have a major impact on node responsiveness to peer messages.

However, the 24-hour timeout seems too aggressive and can be replaced with a much short value like seconds.

Even a few seconds of blocking would cause a huge increase in send latency, something like 100x. If this happens frequently, it could be a big problem for users.


There's another variant that I think would be worth benchmarking. It would be the simplest variant considered so far, and I think it will provide the same or better memory usage in practice as any of the other variants.

0001-brontide-free-pending-message-after-flushing.patch

The main idea is to work with the garbage collector rather than trying to avoid it. We just drop the reference to the pending message as soon as we no longer need it, so that we no longer prevent the garbage collector from reclaiming that buffer. This is a big improvement over master, which currently keeps a reference until a new message needs to be written out or until the entire Brontide is garbage collected.

This should provide memory benefit both under DoS and with regular usage. No new CPU overhead.

@Roasbeef
Copy link
Member Author

Roasbeef commented Sep 16, 2025

What is the "stress test program"?

It just simulates the case of a routing node with maxed out restricted connections, so a case where each new connection its the limit, failing, and maybe eventually succeeding once a new slot frees up.

I still have the profiles lying around so can upload them somewhere.

I also expressed concerns that this PR will perform poorly under regular traffic, potentially even increasing CPU usage. We need CPU profiles to evaluate that.

I haven't observed that at all in my tests, at least not non-GC cpu utilization. What mechanism in the PR do you think would lead to additional active CPU usage?

We especially need CPU data for time spent during GC. For example, if the master branch only spends 5% of its CPU time in GC, then eliminating allocations across all of LND could only increase performance by ~5%!

I think for something like this, we'd need to pin down a given node profile. As an example, a leaf node with only private channels, has a very different access pattern from a very large routing node with many forwarding attempt, mobile clients attempting to connect to it, etc.

The amount of time spent in GC also various across the lifetime of a node. Eg: downloading the channel graph for the very first time generates a lot of garbage than idle operation.

Just to set a base here: we're not after some grand optimization across all of lnd here, we're just focused on allocations due to p2p processing. In any case, if someone has a patch that increase perf by 5%, I think we'd certainly take a look at it.

If that's the case, increasing code complexity to eliminate a small fraction of allocations was silly from the start, and all the discussion on this PR has wasted everyone's time.

Are you referring to this PR as is? There's virtually no code complexity increase, it just moves around some allocations. For the other two PRs, the added code diff is also pretty minor (the largest diff being the pointer swapping in one of the branches). Do you have explicit concerns with either of them? Granted the buffer change shifts around some of the callers a bit.

I wouldn't say the discussion in this PR wasted anyone's time, it's just another PR of the many open PRs that we all review daily/weekly to continually work towards improving lnd.

Why create our own garbage collection scheme instead of relying on Go's optimized garbage collector?

Pooling medium to short lived objects for re-use is a pretty standard pattern, even outside GC languages. Go even provides the sync/pool package for this very use case. Object re-use means less allocations.

I think the question we're attempting to answer is: is it worthwhile to try to clamp down on the total amount and the rate of allocations to service peers?

If we're able to do that, making a tradeoff of latency (10s of ms is fine, we're dealing with round trip delays anyway), then we're able to reduce the memory utilization of the daemon overall, as we're clamping the amount of live garbage we generate due to p2p processing.

@Roasbeef
Copy link
Member Author

Here're the profiles from the initial tests:

noise-buf-allocs-20250911-154557.pb.gz
noise-buf-heap-20250911-154547.pb.gz
noise-buffer-nil-allocs-20250915-180131.pb.gz
noise-buffer-nil-allocs-20250915-180840.pb.gz
noise-buffer-nil-heap-20250915-180124.pb.gz
noise-buffer-nil-heap-20250915-180803.pb.gz
noise-fixed-buf-allocs-20250911-162421.pb.gz
noise-fixed-buf-heap-20250911-162413.pb.gz
noise-nil-allocs-20250910-183907.pb.gz
noise-nil-heap-20250910-183849.pb.gz

noise-buffer-nil is the patch posted above. It performs on par with noise-nil-allocs, which is this PR, plus the pointer hot swap. I agree nil-ing those buffers is simpler than the atomic pointer hot swap that noise-nil-allocs implements. The main difference between those two is that noise-buffer-nil ends up allocating GBs of memory in the encryption library we use, as it allocates a new buffer for each write.

Screenshot 2025-09-15 at 6 12 22 PM

Here's an updated version of that slop report: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f/revisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs golang/build system Related to the go language and compiler no-changelog performance
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

4 participants