bug: Fixes bug causing a hang on Windows #2808
Merged
+160
−66
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release Summary:
Fixed bug that caused a hang on Windows platforms.
Resolved issues:
resolves #2220
Background:
s2n-quic has a bug that causes a hang when receiving data on s2n-quic in Windows. The root cause of this issue is that the consumer of the shared ring buffer resets each payload to be its original length without syncing this change in between primary and secondary memory regions. This causes some payloads to remain very small, making it so that the Producer can only make tiny writes to the un-synced payloads. In Windows eventually this causes a hang.
Notably, this issue was not present on unix as the msghdr struct accesses payload lengths through pointers, meaning that primary and secondary memory regions do not need to be synced for payload length changes.
Another interesting point is that this problem could be seen in our integration tests, even though it did not cause them to fail. I opened #2805 to see if we can fix this.
Description of changes:
I changed the consumer to sync only the payload lengths for each entry that the producer touched. The reason why I did not go with the solution in this PR #2786 is that I am worried about adding another copy call to a hot codepath. I also considered #2806, but that feels like a larger change than this fix.
Huge credit to BiagioFesta for helping with the investigation of this bug as well as helping with a fix. This PR is mostly code pulled from PR #2786.
Call-outs:
Note that the MTU amount of data sent had to be increased as we are not dropping packets in these tests anymore. As a result, the test speeds up and we don't have enough RTTs to finish the MTU search.
Testing:
Includes a new test for the consumer modifying the data. I also no longer see DecryptionFailed events or truncated packets in the MTU tests after this change.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.