brontide: eliminate all allocations from WriteMessage+Flush #10183

Roasbeef · 2025-08-30T14:16:30Z

I got a bit nerd sniped and entered optimizoooor mode.

In this PR, we eliminate all the allocations from the WriteMessage+Flush methods.

To achieve this, we mainly ensure that we re-use static buffers, and that none of those buffers escape to the heap. I also added some documentation that describes the set of tools/techniques I used, so future devs/contributors can follow a similar optimizooooor loop.

The the start of this commit series, here's the benchmark output:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   22736	     50667 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23217	     50463 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24241	     49941 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23574	     51021 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23784	     49926 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24230	     50538 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24058	     49971 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23762	     50224 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   24266	     53034 ns/op	   73788 B/op	       5 allocs/op
BenchmarkWriteMessage-16    	   23317	     50045 ns/op	   73788 B/op	       5 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	17.433s

At the end of the series, here's where we stand:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
BenchmarkWriteMessage-16    	   25840	     46376 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25646	     46672 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25874	     46391 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25512	     46427 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25760	     46309 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25789	     46520 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25602	     46619 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25766	     46464 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25820	     46487 ns/op	       2 B/op	       0 allocs/op
BenchmarkWriteMessage-16    	   25634	     46553 ns/op	       2 B/op	       0 allocs/op
PASS
ok  	github.com/lightningnetwork/lnd/brontide	16.907s

Comparing them, we see all allocations have been eliminated:

goos: darwin
goarch: arm64
pkg: github.com/lightningnetwork/lnd/brontide
cpu: Apple M4 Max
                │   old.txt   │             final.txt              │
                │   sec/op    │   sec/op     vs base               │
WriteMessage-16   50.34µ ± 1%   46.48µ ± 0%  -7.68% (p=0.000 n=10)

                │    old.txt     │              final.txt              │
                │      B/op      │    B/op     vs base                 │
WriteMessage-16   73788.000 ± 0%   2.000 ± 0%  -100.00% (p=0.000 n=10)

                │  old.txt   │              final.txt              │
                │ allocs/op  │ allocs/op   vs base                 │
WriteMessage-16   5.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)

gemini-code-assist

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers a significant performance optimization by refactoring the message writing and flushing mechanisms in the brontide package. The core objective was to remove all dynamic memory allocations from these high-frequency operations, leading to reduced garbage collection overhead and more predictable latency, thereby enhancing the overall efficiency and responsiveness of the system.

Highlights

Zero Allocations: Eliminated all memory allocations from the WriteMessage and Flush methods within the brontide package, a critical hot path for message handling.
Performance Improvement: Achieved a 7.68% speed improvement in WriteMessage operations, as validated by comprehensive benchmarks.
Buffer Reuse: Implemented static buffer reuse for nonces, encrypted headers, and message bodies to prevent heap allocations and improve efficiency.
Documentation: Added a new detailed guide, docs/benchmark_perf_loop.md, explaining the Go performance optimization loop, including benchmarking, profiling, escape analysis, and zero-allocation strategies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an excellent optimization effort that successfully eliminates all allocations from the WriteMessage and Flush methods. The use of pre-allocated buffers on the Machine and cipherState structs is a great application of common Go performance patterns. The new benchmark correctly validates the improvements. The new documentation file is a fantastic and detailed guide for future performance work. I have a few suggestions for the code and documentation to improve clarity, correctness, and adherence to the project's style guide. My feedback includes a point about a missing function comment as required by the style guide (lines 15-16).

brontide/bench_test.go

docs/benchmark_perf_loop.md

In this commit, we add a benchmark for the WriteMessage method. This is the first step in an upcoming optimizooor series of commits. The baseline is: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 22736 50667 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23217 50463 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24241 49941 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23574 51021 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23784 49926 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24230 50538 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24058 49971 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23762 50224 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24266 53034 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23317 50045 ns/op 73788 B/op 5 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 17.433s ```

In this commit, we create fized sized arrays for `nextHeaderSend` and `nextBodySend`. This enables us to reuse these buffers, as we can pass them into Encrypt/Seal directly, which avoids allocations for each write that we do. After this patch, we see the benchmarks improve: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25198 46795 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26331 46186 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25608 46308 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25849 46302 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26318 46285 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25622 46147 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26287 46506 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26155 46827 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25256 46910 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25318 46876 ns/op 34 B/op 3 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 17.092s ``` ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.41µ ± 1% -7.82% (p=0.000 n=10) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.00 ± 0% 34.00 ± 0% -99.95% (p=0.000 n=10) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 3.000 ± 0% -40.00% (p=0.000 n=10) ```

…achine In this commit, we use a fixed sized buffer for the nonce when we read/write messages. This was actually escaping to the heap. We can avoid this by statically allocating it alongside the struct itself. The benchmark state at this point: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25264 47012 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 23542 46809 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25989 47256 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25542 46388 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 26083 46612 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25860 46367 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 24967 46748 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 26088 46485 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25561 46425 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25474 47249 ns/op 4 B/op 1 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 16.911s ``` ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ new2.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.68µ ± 1% -7.28% (p=0.000 n=10) │ old.txt │ new2.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.000 ± 0% 4.000 ± 0% -99.99% (p=0.000 n=10) │ old.txt │ new2.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 1.000 ± 0% -80.00% (p=0.000 n=10) ```

In this commit, we eliminate the final allocation that takes place when we write out messages. Once again this was escaping to the heap, so we make it an attribute on the Machien struct, which allows pure static allocation. ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25840 46376 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25646 46672 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25874 46391 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25512 46427 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25760 46309 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25789 46520 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25602 46619 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25766 46464 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25820 46487 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25634 46553 ns/op 2 B/op 0 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 16.907s ``` With this, we now have eliminated 100% of the allocations! ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ final.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.48µ ± 0% -7.68% (p=0.000 n=10) │ old.txt │ final.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.000 ± 0% 2.000 ± 0% -100.00% (p=0.000 n=10) │ old.txt │ final.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) ```

In this commit, we add some docs that explain how to use tools like heap escape analysis and memory profiling to fully eliminate allocations in a sample program. This guide is meant to help devs/contributors use Go's excellent perf tools to zoom in on an optimization problem.

yyforyongyu · 2025-09-01T12:47:11Z

/gemini review

yyforyongyu

Impressive results from a simple tweak! Just need to fix the CI, otherwise LGTM🙏

yyforyongyu · 2025-09-01T12:26:06Z

docs/benchmark_perf_loop.md

@@ -0,0 +1,378 @@
+# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations


morehouse · 2025-09-02T14:42:40Z

So the net effect is:

We save 4 microseconds per message sent.
We preallocate 64 KB for every peer.

I'm not sure this change is a good tradeoff... Network RTTs are already orders of magnitude higher than the performance improvement, even within the same datacenter. A 4 us improvement is lost in the noise. Meanwhile we increase typical memory usage by 64 KB per peer.

Roasbeef · 2025-09-02T23:13:40Z

So the net effect is:
We save 4 microseconds per message sent.

This is more about entirely eliminating allocations when we write messages, the minor (~7%) speed up isn't the main target at all.

We preallocate 64 KB for every peer.

Correct that we pre-allocate 64 KB per peer, but this memory allocation is now bounded, and doesn't need to later be marked by the GC to be freed (cycles saved traversing the object graph). Memory utilization will now be smoothed out, dampening the peaks you'd see in a typical graph of allocated memory over time for GC'd languages. A node with 1k active connection does indeed see static increase of 64 MB, but that doesn't tell the whole story.

Before we'd allocate each time (5x per write message) each time we went to write out a message to the peer (individual buffers for: nonce, header, body, and packet size -- so more objects for the GC to chase).

Rather than focusing on that single snapshot of allocation, we should consider the rate of allocation over time. If a single peer sends us a 65 KB message, each time we're generating garbage that later needs to be cleaned up, which adds overhead for the GC. The overhead we save isn't necessarily showing up in this profile, as it's a short lived benchmark. On the macro level, we increase the overall performance of the entire daemon, as less time is spent on GC cycles (stop-the-world).

morehouse · 2025-09-03T01:29:05Z

Correct that we pre-allocate 64 KB per peer, but this memory allocation is now bounded, and doesn't need to later be marked by the GC to be freed (cycles saved traversing the object graph). Memory utilization will now be smoothed out, dampening the peaks you'd see in a typical graph of allocated memory over time for GC'd languages. A node with 1k active connection does indeed see static increase of 64 MB, but that doesn't tell the whole story.

Sure, but dampening the peaks by raising the water level isn't always good. e.g., that's why we don't preallocate 10 GB at LND startup.

These allocations are typically very short lived -- they only need to exist long enough to send the outgoing message. They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.

Rather than focusing on that single snapshot of allocation, we should consider the rate of allocation over time. If a single peer sends us a 65 KB message, each time we're generating garbage that later needs to be cleaned up, which adds overhead for the GC. The overhead we save isn't necessarily showing up in this profile, as it's a short lived benchmark. On the macro level, we increase the overall performance of the entire daemon, as less time is spent on GC cycles (stop-the-world).

Can we write a better benchmark then? Or get a before-and-after full node CPU profile under typical conditions and also under heavy load? Let's use data to drive the optimizations rather than guessing.

In this commit, we address a suuuper old TODO to reset the brontide state. We were allocating a read buf which we now set to nil, and we free up the original brontide reference.

gijswijs

They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.

You're not wrong, but the stated goal is zero (heap) allocs, so then you will have to pre-allocate the max message size.

gijswijs · 2025-09-04T12:01:38Z

docs/benchmark_perf_loop.md

+- The percentage change (vs base column) showing the magnitude of improvement
+
+- The p-value (p=0.000) indicating statistical significance - values below 0.05
+suggest real improvements rather than noise


The Bayesian in me disagrees, but let's keep frequentist frame of reference for the statistics.

This is a great guide by the way.

morehouse · 2025-09-04T15:50:45Z

You're not wrong, but the stated goal is zero (heap) allocs, so then you will have to pre-allocate the max message size.

Ultimately the goal is improved performance. Eliminating heap allocs is one way to affect that goal, but this is likely a case where there's no performance benefit 99.9% of the time since GC is already optimized efficient for the small, infrequent, short-lived allocations that are being eliminated.

There may actually be a performance regression when not under DoS attack, since we're now creating large long-lived allocations for every peer instead of small short-lived ones. Every time GC occurs, those large allocations need to be scanned in their entirety, causing GC pauses to last longer.

And regardless of the performance improvement or regression, there's a guaranteed cost in memory overhead that this change introduces. This overhead needs to be weighed against whatever potential performance benefit there may be.

This is why we need good performance data to drive the optimization cycle. Guessing often produces incorrect results, while increasing code complexity.

yyforyongyu

This PR aims at adding a worst-case bound, that is 65KB per peer, for the encryption of wire msgs. I think it's a good starting point given this is how we handle writing plaintext, however, I overlooked that we use a pool instead of static upfront allocations,

lnd/peer/brontide.go

Lines 2637 to 2652 in 9d74ec4

    
           // Otherwise, this is a new message. We'll acquire a write buffer to 
        
           // serialize the message and buffer the ciphertext on the connection. 
        
           err := p.cfg.WritePool.Submit(func(buf *bytes.Buffer) error { 
        
           	// Using a buffer allocated by the write pool, encode the 
        
           	// message directly into the buffer. 
        
           	_, writeErr := lnwire.WriteMessage(buf, msg, 0) 
        
           	if writeErr != nil { 
        
           		return writeErr 
        
           	} 
        
           	// Finally, write the message itself in a single swoop. This 
        
           	// will buffer the ciphertext on the underlying connection. We 
        
           	// will defer flushing the message until the write pool has been 
        
           	// released. 
        
           	return noiseConn.WriteMessage(buf.Bytes()) 
        
           })

When encoding the plaintext, we will take a buffer from the write pool, and return it once done. The size of the buffer is 65KB, but it's very different from static upfront allocations - if we have 10 peers writing msg at the same time out of 1000 peers, based on this PR we'd need 65 MB memory allocation, while we only need 650 KB when using the write pool.

Given we already have this write pool, which can allocate 100 objects at max, so the memory is bounded at 6.5 MB; and it has its own dynamic GC, which release the buffer when it's not used for 30s, I think we can instead use it, tho the current WritePool needs to have more methods than a single Submit - in specific, it needs have a Take and a Return, which has already been built in bufferPool, we just need to expose it in WritePool. Then the writeMessage shown above will:

take a buffer from the pool when writing the msg, which includes encoding and encryption.
return the buffer when it's flushed.

The other aspect I didn't realized was the increased GC pressure - on its own it seems like we can now just freeze a node by increasing the num of peers, but that num is restricted already.

Roasbeef · 2025-09-06T01:34:48Z

Plan to run this across a few diff nodes before landing.

morehouse

NACK.

The current benchmark suggests a maximum 7% performance improvement in extremely pathological cases (constant streaming of 64KB messages to the peer). The benchmark is extremely unrealistic, even under a DoS attack, because in practice Flush actually waits for at least the network RTT before returning, which takes ~1000x more time than the benchmark currently takes into account. This suggests the actual performance improvement under DoS is basically 0%.

@Roasbeef mentioned that there's other benefit not captured by the benchmark, since GC likely doesn't happen during the short-lived benchmark. ISTM the benchmark already captures GC behavior, since it sends data at a rate higher than 1 GB/s based on the benchmark data in the OP. Even running the benchmark for just 1 second likely causes GC to run multiple times.

So zero benefit, plus ~30x increase in per-Brontide memory is a loss. Until I see data suggesting otherwise, I can't get on board with this.

Finally, the latest commit introduced an extremely trivial DoS vector. Oops.

morehouse · 2025-09-06T14:21:22Z

brontide/conn.go

 func (c *Conn) Close() error {
-	// TODO(roasbeef): reset brontide state?
+	// Clear the state we created to be able to handle this connection.
+	c.noise = nil
+	c.readBuf = bytes.Buffer{}
+


This is a null deref waiting to happen. It is common for Brontide goroutines to continue calling Conn methods after the connection has been closed.

Currently disconnect is guarded by an atomic variable, so callers to it will only succeed once.

If I look at the other callers into p.conn, I see one one the string method. But not many others.

In either case, we can have a similar CAS check on calls into the connection after a it's been closed. Thanks for pointing this out, I hadn't got around to testing this yet on an actual node.

Roasbeef · 2025-09-11T02:25:39Z

Given we already have this write pool, which can allocate 100 objects at max, so the memory is bounded at 6.5 MB; and it has its own dynamic GC, which release the buffer when it's not used for 30s, I think we can instead use it, tho the current WritePool needs to have more methods than a single Submit - in specific, it needs have a Take and a Return, which has already been built in bufferPool, we just need to expose it in WritePool.

I think this makes a lot of sense. I'll take a stab at this implementation. I think the smaller static buffers can still be kept as part of the brontide.Machine struct.

I've been running a modified (few diff iterations) against some live nodes and there's a clear difference between this and the current state of master. I think your idea is even better though, as it reuses a system we already have to reuse buffers for serializing messages, and we can extend it to use it for the ciphertexts as well.

Roasbeef · 2025-09-11T23:44:26Z

Thanks everyone for all the feedback.

I've run some benchmarks with 3 variants of this PR:

One that fixes the issue Morehouse pointed out above, using an atomic pointer to swap out the brontide.Machine pointer: master...Roasbeef:lnd:brontide-fix-nil.
One that uses yy's suggestion to just hook into the existing buffer pool: master...Roasbeef:lnd:brontide-buf-pool.
A variant on the above, that modifies the pool (temp patch here, has wider implications) to use 2x the number of restricted connections: master...Roasbeef:lnd:brontide-buf-fixed-pool.

PR as is

With that extra commit, when run with a stress test program (thousands of peers), the PR holds up fairly well. Compared to the others, it allocates more when calling brontide.NewMachine.

Here's a flame graph after running the stress test program for a few minutes:

In this case, a lot of the overhead is actually creating the peer struct.

PR with buf pools

When adding the buf pools, the issue is that even though we create 100 workers, the pool will continue to allocate new buffers. We use a very small timeout of just 1 ms, so most of the time it drops down to create a new set of buffers. So although the buffers are re-used we have a lot of allocations as each time we're basically allocating entirely new buffers.

You can see this clearly in the flamegraph:

PR with fixed buf pools

For this variant, I increased the returnQueue size to 2x the number of restricted connections. Then I modified the Take() logic to basically block forever until a new buffer is created.

This performs better than w/o these changes, as now we're forcing a smaller working set. This has some tradeoffs, as write x-put with many peers may be hampered as now we'll block for new buffers instead of creating them as needed.

Here's a slop report generated from the raw profile files: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f

Thoughts?

yyforyongyu

Thanks for the updates! I prefer the last one brontide-buf-fixed-pool as it eliminates memory allocations, reducing GC pressure and improving latency and improves performance under load: more predictable memory allocation and another on leaking buffers from the pool when peers disconnect abruptly. However, the 24-hour timeout seems too aggressive and can be replaced with a much short value like seconds.

morehouse · 2025-09-15T16:04:18Z

@Roasbeef: Thanks for collecting more data. I don't think the data addresses any of my expressed concerns though. Key information needed to properly optimize:

What is the "stress test program"?
How do the variants handle regular traffic (no stress test)?
A control group (i.e. master branch) is needed.
CPU profiles of each variant are critical to determine if the CPU-memory tradeoff is good.
Ideally provide the SVGs for the web visualizations of all CPU and heap profiles, so we can study them ourselves.

I expressed concerns that this PR increases memory usage vs master. So we need memory usage data for master to properly evaluate that.

I also expressed concerns that this PR will perform poorly under regular traffic, potentially even increasing CPU usage. We need CPU profiles to evaluate that.

We especially need CPU data for time spent during GC. For example, if the master branch only spends 5% of its CPU time in GC, then eliminating allocations across all of LND could only increase performance by ~5%! If that's the case, increasing code complexity to eliminate a small fraction of allocations was silly from the start, and all the discussion on this PR has wasted everyone's time.

@yyforyongyu: I am skeptical that using the buffer pool will provide good memory or CPU benefit in practice. Why create our own garbage collection scheme instead of relying on Go's optimized garbage collector? Implementing a GC is tricky, which @Roasbeef's "PR with buf pools" clearly demonstrates -- it increases memory usage due to slow memory release.

The "PR with fixed buf pools" makes a memory-time tradeoff --- it imposes a strong memory limit by introducing blocking behavior. We really need data about how much additional latency is added by this change to decide if the tradeoff is worthwhile. Since previously there was a maximum 1ms pause and now it blocks forever, this could have a major impact on node responsiveness to peer messages.

However, the 24-hour timeout seems too aggressive and can be replaced with a much short value like seconds.

Even a few seconds of blocking would cause a huge increase in send latency, something like 100x. If this happens frequently, it could be a big problem for users.

There's another variant that I think would be worth benchmarking. It would be the simplest variant considered so far, and I think it will provide the same or better memory usage in practice as any of the other variants.

0001-brontide-free-pending-message-after-flushing.patch

The main idea is to work with the garbage collector rather than trying to avoid it. We just drop the reference to the pending message as soon as we no longer need it, so that we no longer prevent the garbage collector from reclaiming that buffer. This is a big improvement over master, which currently keeps a reference until a new message needs to be written out or until the entire Brontide is garbage collected.

This should provide memory benefit both under DoS and with regular usage. No new CPU overhead.

Roasbeef · 2025-09-16T00:49:14Z

What is the "stress test program"?

It just simulates the case of a routing node with maxed out restricted connections, so a case where each new connection its the limit, failing, and maybe eventually succeeding once a new slot frees up.

I still have the profiles lying around so can upload them somewhere.

I also expressed concerns that this PR will perform poorly under regular traffic, potentially even increasing CPU usage. We need CPU profiles to evaluate that.

I haven't observed that at all in my tests, at least not non-GC cpu utilization. What mechanism in the PR do you think would lead to additional active CPU usage?

We especially need CPU data for time spent during GC. For example, if the master branch only spends 5% of its CPU time in GC, then eliminating allocations across all of LND could only increase performance by ~5%!

I think for something like this, we'd need to pin down a given node profile. As an example, a leaf node with only private channels, has a very different access pattern from a very large routing node with many forwarding attempt, mobile clients attempting to connect to it, etc.

The amount of time spent in GC also various across the lifetime of a node. Eg: downloading the channel graph for the very first time generates a lot of garbage than idle operation.

Just to set a base here: we're not after some grand optimization across all of lnd here, we're just focused on allocations due to p2p processing. In any case, if someone has a patch that increase perf by 5%, I think we'd certainly take a look at it.

If that's the case, increasing code complexity to eliminate a small fraction of allocations was silly from the start, and all the discussion on this PR has wasted everyone's time.

Are you referring to this PR as is? There's virtually no code complexity increase, it just moves around some allocations. For the other two PRs, the added code diff is also pretty minor (the largest diff being the pointer swapping in one of the branches). Do you have explicit concerns with either of them? Granted the buffer change shifts around some of the callers a bit.

I wouldn't say the discussion in this PR wasted anyone's time, it's just another PR of the many open PRs that we all review daily/weekly to continually work towards improving lnd.

Why create our own garbage collection scheme instead of relying on Go's optimized garbage collector?

Pooling medium to short lived objects for re-use is a pretty standard pattern, even outside GC languages. Go even provides the sync/pool package for this very use case. Object re-use means less allocations.

I think the question we're attempting to answer is: is it worthwhile to try to clamp down on the total amount and the rate of allocations to service peers?

If we're able to do that, making a tradeoff of latency (10s of ms is fine, we're dealing with round trip delays anyway), then we're able to reduce the memory utilization of the daemon overall, as we're clamping the amount of live garbage we generate due to p2p processing.

Roasbeef · 2025-09-16T02:32:34Z

Here're the profiles from the initial tests:

noise-buf-allocs-20250911-154557.pb.gz
noise-buf-heap-20250911-154547.pb.gz
noise-buffer-nil-allocs-20250915-180131.pb.gz
noise-buffer-nil-allocs-20250915-180840.pb.gz
noise-buffer-nil-heap-20250915-180124.pb.gz
noise-buffer-nil-heap-20250915-180803.pb.gz
noise-fixed-buf-allocs-20250911-162421.pb.gz
noise-fixed-buf-heap-20250911-162413.pb.gz
noise-nil-allocs-20250910-183907.pb.gz
noise-nil-heap-20250910-183849.pb.gz

noise-buffer-nil is the patch posted above. It performs on par with noise-nil-allocs, which is this PR, plus the pointer hot swap. I agree nil-ing those buffers is simpler than the atomic pointer hot swap that noise-nil-allocs implements. The main difference between those two is that noise-buffer-nil ends up allocating GBs of memory in the encryption library we use, as it allocates a new buffer for each write.

Here's an updated version of that slop report: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f/revisions.

Roasbeef added golang/build system Related to the go language and compiler performance docs no-changelog labels Aug 30, 2025

gemini-code-assist bot reviewed Aug 30, 2025

View reviewed changes

brontide/bench_test.go Show resolved Hide resolved

docs/benchmark_perf_loop.md Outdated Show resolved Hide resolved

docs/benchmark_perf_loop.md Outdated Show resolved Hide resolved

docs/benchmark_perf_loop.md Show resolved Hide resolved

Roasbeef force-pushed the brontide-static-alloac branch from e8a612f to e829840 Compare August 30, 2025 14:21

Roasbeef added 6 commits August 30, 2025 15:51

brontide: add new maxMessageSize variable

df98ce1

Roasbeef force-pushed the brontide-static-alloac branch from e829840 to 21c21b7 Compare August 30, 2025 14:51

yyforyongyu added this to the v0.20.0 milestone Sep 1, 2025

yyforyongyu approved these changes Sep 1, 2025

View reviewed changes

docs/benchmark_perf_loop.md

@@ -0,0 +1,378 @@

# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations

Copy link

Member

yyforyongyu Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

saubyk requested a review from gijswijs September 2, 2025 17:06

saubyk assigned Roasbeef Sep 2, 2025

saubyk added this to lnd v0.20 Sep 2, 2025

saubyk moved this to In progress in lnd v0.20 Sep 2, 2025

saubyk moved this from In progress to In review in lnd v0.20 Sep 2, 2025

brontide: address old TODO to reset the brontide state

a58e329

In this commit, we address a suuuper old TODO to reset the brontide state. We were allocating a read buf which we now set to nil, and we free up the original brontide reference.

gijswijs approved these changes Sep 4, 2025

View reviewed changes

yyforyongyu reviewed Sep 5, 2025

View reviewed changes

morehouse requested changes Sep 6, 2025

View reviewed changes

yyforyongyu requested changes Sep 12, 2025

View reviewed changes

		@@ -0,0 +1,378 @@
		# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations

	// Otherwise, this is a new message. We'll acquire a write buffer to
	// serialize the message and buffer the ciphertext on the connection.
	err := p.cfg.WritePool.Submit(func(buf *bytes.Buffer) error {
	// Using a buffer allocated by the write pool, encode the
	// message directly into the buffer.
	_, writeErr := lnwire.WriteMessage(buf, msg, 0)
	if writeErr != nil {
	return writeErr
	}

	// Finally, write the message itself in a single swoop. This
	// will buffer the ciphertext on the underlying connection. We
	// will defer flushing the message until the write pool has been
	// released.
	return noiseConn.WriteMessage(buf.Bytes())
	})

brontide: eliminate all allocations from WriteMessage+Flush #10183

Are you sure you want to change the base?

brontide: eliminate all allocations from WriteMessage+Flush #10183

Conversation

Roasbeef commented Aug 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yyforyongyu commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

yyforyongyu Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

morehouse commented Sep 2, 2025

Uh oh!

Roasbeef commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

morehouse commented Sep 3, 2025

Uh oh!

gijswijs left a comment

Choose a reason for hiding this comment

Uh oh!

gijswijs Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

morehouse commented Sep 4, 2025

Uh oh!

yyforyongyu left a comment

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented Sep 6, 2025

Uh oh!

morehouse left a comment

Choose a reason for hiding this comment

Uh oh!

morehouse Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

Roasbeef Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef commented Sep 11, 2025

PR as is

PR with buf pools

PR with fixed buf pools

Uh oh!

yyforyongyu left a comment • edited by Roasbeef Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morehouse commented Sep 15, 2025

Uh oh!

Roasbeef commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roasbeef commented Sep 16, 2025

Uh oh!

Uh oh!

yyforyongyu commented Sep 1, 2025 •

edited

Loading

Roasbeef commented Sep 2, 2025 •

edited

Loading

Roasbeef Sep 10, 2025 •

edited

Loading

Roasbeef commented Sep 11, 2025 •

edited

Loading

yyforyongyu left a comment •

edited by Roasbeef

Loading

Roasbeef commented Sep 16, 2025 •

edited

Loading