-
Notifications
You must be signed in to change notification settings - Fork 2.2k
brontide: eliminate all allocations from WriteMessage+Flush #10183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers a significant performance optimization by refactoring the message writing and flushing mechanisms in the brontide
package. The core objective was to remove all dynamic memory allocations from these high-frequency operations, leading to reduced garbage collection overhead and more predictable latency, thereby enhancing the overall efficiency and responsiveness of the system.
Highlights
- Zero Allocations: Eliminated all memory allocations from the
WriteMessage
andFlush
methods within thebrontide
package, a critical hot path for message handling. - Performance Improvement: Achieved a 7.68% speed improvement in
WriteMessage
operations, as validated by comprehensive benchmarks. - Buffer Reuse: Implemented static buffer reuse for nonces, encrypted headers, and message bodies to prevent heap allocations and improve efficiency.
- Documentation: Added a new detailed guide,
docs/benchmark_perf_loop.md
, explaining the Go performance optimization loop, including benchmarking, profiling, escape analysis, and zero-allocation strategies.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an excellent optimization effort that successfully eliminates all allocations from the WriteMessage
and Flush
methods. The use of pre-allocated buffers on the Machine
and cipherState
structs is a great application of common Go performance patterns. The new benchmark correctly validates the improvements. The new documentation file is a fantastic and detailed guide for future performance work. I have a few suggestions for the code and documentation to improve clarity, correctness, and adherence to the project's style guide. My feedback includes a point about a missing function comment as required by the style guide (lines 15-16).
e8a612f
to
e829840
Compare
In this commit, we add a benchmark for the WriteMessage method. This is the first step in an upcoming optimizooor series of commits. The baseline is: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 22736 50667 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23217 50463 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24241 49941 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23574 51021 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23784 49926 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24230 50538 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24058 49971 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23762 50224 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 24266 53034 ns/op 73788 B/op 5 allocs/op BenchmarkWriteMessage-16 23317 50045 ns/op 73788 B/op 5 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 17.433s ```
In this commit, we create fized sized arrays for `nextHeaderSend` and `nextBodySend`. This enables us to reuse these buffers, as we can pass them into Encrypt/Seal directly, which avoids allocations for each write that we do. After this patch, we see the benchmarks improve: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25198 46795 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26331 46186 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25608 46308 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25849 46302 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26318 46285 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25622 46147 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26287 46506 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 26155 46827 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25256 46910 ns/op 34 B/op 3 allocs/op BenchmarkWriteMessage-16 25318 46876 ns/op 34 B/op 3 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 17.092s ``` ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.41µ ± 1% -7.82% (p=0.000 n=10) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.00 ± 0% 34.00 ± 0% -99.95% (p=0.000 n=10) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 3.000 ± 0% -40.00% (p=0.000 n=10) ```
…achine In this commit, we use a fixed sized buffer for the nonce when we read/write messages. This was actually escaping to the heap. We can avoid this by statically allocating it alongside the struct itself. The benchmark state at this point: ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25264 47012 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 23542 46809 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25989 47256 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25542 46388 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 26083 46612 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25860 46367 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 24967 46748 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 26088 46485 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25561 46425 ns/op 4 B/op 1 allocs/op BenchmarkWriteMessage-16 25474 47249 ns/op 4 B/op 1 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 16.911s ``` ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ new2.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.68µ ± 1% -7.28% (p=0.000 n=10) │ old.txt │ new2.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.000 ± 0% 4.000 ± 0% -99.99% (p=0.000 n=10) │ old.txt │ new2.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 1.000 ± 0% -80.00% (p=0.000 n=10) ```
In this commit, we eliminate the final allocation that takes place when we write out messages. Once again this was escaping to the heap, so we make it an attribute on the Machien struct, which allows pure static allocation. ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max BenchmarkWriteMessage-16 25840 46376 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25646 46672 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25874 46391 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25512 46427 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25760 46309 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25789 46520 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25602 46619 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25766 46464 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25820 46487 ns/op 2 B/op 0 allocs/op BenchmarkWriteMessage-16 25634 46553 ns/op 2 B/op 0 allocs/op PASS ok github.com/lightningnetwork/lnd/brontide 16.907s ``` With this, we now have eliminated 100% of the allocations! ``` goos: darwin goarch: arm64 pkg: github.com/lightningnetwork/lnd/brontide cpu: Apple M4 Max │ old.txt │ final.txt │ │ sec/op │ sec/op vs base │ WriteMessage-16 50.34µ ± 1% 46.48µ ± 0% -7.68% (p=0.000 n=10) │ old.txt │ final.txt │ │ B/op │ B/op vs base │ WriteMessage-16 73788.000 ± 0% 2.000 ± 0% -100.00% (p=0.000 n=10) │ old.txt │ final.txt │ │ allocs/op │ allocs/op vs base │ WriteMessage-16 5.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10) ```
In this commit, we add some docs that explain how to use tools like heap escape analysis and memory profiling to fully eliminate allocations in a sample program. This guide is meant to help devs/contributors use Go's excellent perf tools to zoom in on an optimization problem.
e829840
to
21c21b7
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive results from a simple tweak! Just need to fix the CI, otherwise LGTM🙏
@@ -0,0 +1,378 @@ | |||
# The Go Performance Optimization Loop: From Benchmarks to Zero Allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
So the net effect is:
I'm not sure this change is a good tradeoff... Network RTTs are already orders of magnitude higher than the performance improvement, even within the same datacenter. A 4 us improvement is lost in the noise. Meanwhile we increase typical memory usage by 64 KB per peer. |
This is more about entirely eliminating allocations when we write messages, the minor (~7%) speed up isn't the main target at all.
Correct that we pre-allocate 64 KB per peer, but this memory allocation is now bounded, and doesn't need to later be marked by the GC to be freed (cycles saved traversing the object graph). Memory utilization will now be smoothed out, dampening the peaks you'd see in a typical graph of allocated memory over time for GC'd languages. A node with 1k active connection does indeed see static increase of 64 MB, but that doesn't tell the whole story. Before we'd allocate each time (5x per write message) each time we went to write out a message to the peer (individual buffers for: nonce, header, body, and packet size -- so more objects for the GC to chase). Rather than focusing on that single snapshot of allocation, we should consider the rate of allocation over time. If a single peer sends us a 65 KB message, each time we're generating garbage that later needs to be cleaned up, which adds overhead for the GC. The overhead we save isn't necessarily showing up in this profile, as it's a short lived benchmark. On the macro level, we increase the overall performance of the entire daemon, as less time is spent on GC cycles (stop-the-world). |
Sure, but dampening the peaks by raising the water level isn't always good. e.g., that's why we don't preallocate 10 GB at LND startup. These allocations are typically very short lived -- they only need to exist long enough to send the outgoing message. They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.
Can we write a better benchmark then? Or get a before-and-after full node CPU profile under typical conditions and also under heavy load? Let's use data to drive the optimizations rather than guessing. |
In this commit, we address a suuuper old TODO to reset the brontide state. We were allocating a read buf which we now set to nil, and we free up the original brontide reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're also typically quite small (orders of magnitude smaller than the 64 KB worst case). And most peers don't constantly send messages. In these conditions, preallocating for the worst case isn't necessarily a win.
You're not wrong, but the stated goal is zero (heap) allocs, so then you will have to pre-allocate the max message size.
- The percentage change (vs base column) showing the magnitude of improvement | ||
|
||
- The p-value (p=0.000) indicating statistical significance - values below 0.05 | ||
suggest real improvements rather than noise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Bayesian in me disagrees, but let's keep frequentist frame of reference for the statistics.
This is a great guide by the way.
Ultimately the goal is improved performance. Eliminating heap allocs is one way to affect that goal, but this is likely a case where there's no performance benefit 99.9% of the time since GC is already optimized efficient for the small, infrequent, short-lived allocations that are being eliminated. There may actually be a performance regression when not under DoS attack, since we're now creating large long-lived allocations for every peer instead of small short-lived ones. Every time GC occurs, those large allocations need to be scanned in their entirety, causing GC pauses to last longer. And regardless of the performance improvement or regression, there's a guaranteed cost in memory overhead that this change introduces. This overhead needs to be weighed against whatever potential performance benefit there may be. This is why we need good performance data to drive the optimization cycle. Guessing often produces incorrect results, while increasing code complexity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR aims at adding a worst-case bound, that is 65KB per peer, for the encryption of wire msgs. I think it's a good starting point given this is how we handle writing plaintext, however, I overlooked that we use a pool instead of static upfront allocations,
Lines 2637 to 2652 in 9d74ec4
// Otherwise, this is a new message. We'll acquire a write buffer to | |
// serialize the message and buffer the ciphertext on the connection. | |
err := p.cfg.WritePool.Submit(func(buf *bytes.Buffer) error { | |
// Using a buffer allocated by the write pool, encode the | |
// message directly into the buffer. | |
_, writeErr := lnwire.WriteMessage(buf, msg, 0) | |
if writeErr != nil { | |
return writeErr | |
} | |
// Finally, write the message itself in a single swoop. This | |
// will buffer the ciphertext on the underlying connection. We | |
// will defer flushing the message until the write pool has been | |
// released. | |
return noiseConn.WriteMessage(buf.Bytes()) | |
}) |
When encoding the plaintext, we will take a buffer from the write pool, and return it once done. The size of the buffer is 65KB, but it's very different from static upfront allocations - if we have 10 peers writing msg at the same time out of 1000 peers, based on this PR we'd need 65 MB memory allocation, while we only need 650 KB when using the write pool.
Given we already have this write pool, which can allocate 100 objects at max, so the memory is bounded at 6.5 MB; and it has its own dynamic GC, which release the buffer when it's not used for 30s, I think we can instead use it, tho the current WritePool
needs to have more methods than a single Submit
- in specific, it needs have a Take
and a Return
, which has already been built in bufferPool
, we just need to expose it in WritePool
. Then the writeMessage
shown above will:
- take a buffer from the pool when writing the msg, which includes encoding and encryption.
- return the buffer when it's flushed.
The other aspect I didn't realized was the increased GC pressure - on its own it seems like we can now just freeze a node by increasing the num of peers, but that num is restricted already.
Plan to run this across a few diff nodes before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NACK.
The current benchmark suggests a maximum 7% performance improvement in extremely pathological cases (constant streaming of 64KB messages to the peer). The benchmark is extremely unrealistic, even under a DoS attack, because in practice Flush
actually waits for at least the network RTT before returning, which takes ~1000x more time than the benchmark currently takes into account. This suggests the actual performance improvement under DoS is basically 0%.
@Roasbeef mentioned that there's other benefit not captured by the benchmark, since GC likely doesn't happen during the short-lived benchmark. ISTM the benchmark already captures GC behavior, since it sends data at a rate higher than 1 GB/s based on the benchmark data in the OP. Even running the benchmark for just 1 second likely causes GC to run multiple times.
So zero benefit, plus ~30x increase in per-Brontide memory is a loss. Until I see data suggesting otherwise, I can't get on board with this.
Finally, the latest commit introduced an extremely trivial DoS vector. Oops.
func (c *Conn) Close() error { | ||
// TODO(roasbeef): reset brontide state? | ||
// Clear the state we created to be able to handle this connection. | ||
c.noise = nil | ||
c.readBuf = bytes.Buffer{} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a null deref waiting to happen. It is common for Brontide goroutines to continue calling Conn
methods after the connection has been closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently disconnect is guarded by an atomic variable, so callers to it will only succeed once.
If I look at the other callers into p.conn
, I see one one the string method. But not many others.
In either case, we can have a similar CAS check on calls into the connection after a it's been closed. Thanks for pointing this out, I hadn't got around to testing this yet on an actual node.
I think this makes a lot of sense. I'll take a stab at this implementation. I think the smaller static buffers can still be kept as part of the I've been running a modified (few diff iterations) against some live nodes and there's a clear difference between this and the current state of master. I think your idea is even better though, as it reuses a system we already have to reuse buffers for serializing messages, and we can extend it to use it for the ciphertexts as well. |
Thanks everyone for all the feedback. I've run some benchmarks with 3 variants of this PR:
PR as isWith that extra commit, when run with a stress test program (thousands of peers), the PR holds up fairly well. Compared to the others, it allocates more when calling Here's a flame graph after running the stress test program for a few minutes: In this case, a lot of the overhead is actually creating the peer struct. PR with buf poolsWhen adding the buf pools, the issue is that even though we create 100 workers, the pool will continue to allocate new buffers. We use a very small timeout of just 1 ms, so most of the time it drops down to create a new set of buffers. So although the buffers are re-used we have a lot of allocations as each time we're basically allocating entirely new buffers. You can see this clearly in the flamegraph: ![]() PR with fixed buf poolsFor this variant, I increased the This performs better than w/o these changes, as now we're forcing a smaller working set. This has some tradeoffs, as write x-put with many peers may be hampered as now we'll block for new buffers instead of creating them as needed. ![]() Here's a slop report generated from the raw profile files: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I prefer the last one brontide-buf-fixed-pool
as it eliminates memory allocations, reducing GC pressure and improving latency and improves performance under load: more predictable memory allocation and another on leaking buffers from the pool when peers disconnect abruptly. However, the 24-hour timeout seems too aggressive and can be replaced with a much short value like seconds.
@Roasbeef: Thanks for collecting more data. I don't think the data addresses any of my expressed concerns though. Key information needed to properly optimize:
I expressed concerns that this PR increases memory usage vs master. So we need memory usage data for master to properly evaluate that. I also expressed concerns that this PR will perform poorly under regular traffic, potentially even increasing CPU usage. We need CPU profiles to evaluate that. We especially need CPU data for time spent during GC. For example, if the master branch only spends 5% of its CPU time in GC, then eliminating allocations across all of LND could only increase performance by ~5%! If that's the case, increasing code complexity to eliminate a small fraction of allocations was silly from the start, and all the discussion on this PR has wasted everyone's time. @yyforyongyu: I am skeptical that using the buffer pool will provide good memory or CPU benefit in practice. Why create our own garbage collection scheme instead of relying on Go's optimized garbage collector? Implementing a GC is tricky, which @Roasbeef's "PR with buf pools" clearly demonstrates -- it increases memory usage due to slow memory release. The "PR with fixed buf pools" makes a memory-time tradeoff --- it imposes a strong memory limit by introducing blocking behavior. We really need data about how much additional latency is added by this change to decide if the tradeoff is worthwhile. Since previously there was a maximum 1ms pause and now it blocks forever, this could have a major impact on node responsiveness to peer messages.
Even a few seconds of blocking would cause a huge increase in send latency, something like 100x. If this happens frequently, it could be a big problem for users. There's another variant that I think would be worth benchmarking. It would be the simplest variant considered so far, and I think it will provide the same or better memory usage in practice as any of the other variants. 0001-brontide-free-pending-message-after-flushing.patch The main idea is to work with the garbage collector rather than trying to avoid it. We just drop the reference to the pending message as soon as we no longer need it, so that we no longer prevent the garbage collector from reclaiming that buffer. This is a big improvement over master, which currently keeps a reference until a new message needs to be written out or until the entire Brontide is garbage collected. This should provide memory benefit both under DoS and with regular usage. No new CPU overhead. |
It just simulates the case of a routing node with maxed out restricted connections, so a case where each new connection its the limit, failing, and maybe eventually succeeding once a new slot frees up. I still have the profiles lying around so can upload them somewhere.
I haven't observed that at all in my tests, at least not non-GC cpu utilization. What mechanism in the PR do you think would lead to additional active CPU usage?
I think for something like this, we'd need to pin down a given node profile. As an example, a leaf node with only private channels, has a very different access pattern from a very large routing node with many forwarding attempt, mobile clients attempting to connect to it, etc. The amount of time spent in GC also various across the lifetime of a node. Eg: downloading the channel graph for the very first time generates a lot of garbage than idle operation. Just to set a base here: we're not after some grand optimization across all of
Are you referring to this PR as is? There's virtually no code complexity increase, it just moves around some allocations. For the other two PRs, the added code diff is also pretty minor (the largest diff being the pointer swapping in one of the branches). Do you have explicit concerns with either of them? Granted the buffer change shifts around some of the callers a bit. I wouldn't say the discussion in this PR wasted anyone's time, it's just another PR of the many open PRs that we all review daily/weekly to continually work towards improving
Pooling medium to short lived objects for re-use is a pretty standard pattern, even outside GC languages. Go even provides the I think the question we're attempting to answer is: is it worthwhile to try to clamp down on the total amount and the rate of allocations to service peers? If we're able to do that, making a tradeoff of latency (10s of ms is fine, we're dealing with round trip delays anyway), then we're able to reduce the memory utilization of the daemon overall, as we're clamping the amount of live garbage we generate due to p2p processing. |
Here're the profiles from the initial tests: noise-buf-allocs-20250911-154557.pb.gz
![]() Here's an updated version of that slop report: https://gist.github.com/Roasbeef/095e3d91639809731253253eabe8a26f/revisions. |
I got a bit nerd sniped and entered optimizoooor mode.
In this PR, we eliminate all the allocations from the
WriteMessage
+Flush
methods.To achieve this, we mainly ensure that we re-use static buffers, and that none of those buffers escape to the heap. I also added some documentation that describes the set of tools/techniques I used, so future devs/contributors can follow a similar optimizooooor loop.
The the start of this commit series, here's the benchmark output:
At the end of the series, here's where we stand:
Comparing them, we see all allocations have been eliminated: