Skip to content

Conversation

ciarams87
Copy link
Contributor

@ciarams87 ciarams87 commented Sep 15, 2025

Proposed changes

Problem: During ServiceAccountToken rotation, nginx-gateway-fabric was sometimes experiencing deadlocks due to the Subscribe method incorrectly intercepting initial configuration operations after a ServiceAccountToken rotation and incorrect signaling broadcast completion

Solution: Implemented a pendingRequest tracking mechanism to distinguish between initial setup operations and broadcast operations:

  • Initial config operations: Do not signal ResponseCh (prevents spurious broadcast signals)
  • Broadcast operations: Only signal ResponseCh when there's a pending broadcast request

Also added additional logging to improve debugging

Testing: Updated unit tests, existing functional and conformance tests pass, and reran longevity (with the corresponding Agent fixes) which now works

Partially closes #3626

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Fixed a bug where the subscribe method was incorrectly intercepting initial configuration operations after a ServiceAccountToken rotation and signaling broadcast completion

@github-actions github-actions bot added the bug Something isn't working label Sep 15, 2025
Copy link

codecov bot commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 56.75676% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.78%. Comparing base (5c3fc1b) to head (7fabec5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/nginx/agent/file.go 25.00% 8 Missing and 1 partial ⚠️
internal/controller/nginx/agent/command.go 58.82% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3905      +/-   ##
==========================================
- Coverage   86.86%   86.78%   -0.09%     
==========================================
  Files         128      128              
  Lines       16519    16547      +28     
  Branches       62       62              
==========================================
+ Hits        14350    14361      +11     
- Misses       1990     2005      +15     
- Partials      179      181       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ciarams87 ciarams87 changed the title Improve connection reset handling Improve connection reset handling during ServiceAccountToken rotation Sep 15, 2025
@ciarams87 ciarams87 force-pushed the fix/improve-conn-reset-handling branch from f8f59ec to 7429e91 Compare September 15, 2025 16:10
@ciarams87 ciarams87 marked this pull request as ready for review September 15, 2025 16:10
@ciarams87 ciarams87 requested a review from a team as a code owner September 15, 2025 16:10
@ciarams87 ciarams87 force-pushed the fix/improve-conn-reset-handling branch from 7429e91 to 47e712c Compare September 16, 2025 08:18
@ciarams87 ciarams87 force-pushed the fix/improve-conn-reset-handling branch 2 times, most recently from 352806c to 5d7fef2 Compare September 17, 2025 15:09
Copy link
Contributor

@bjee19 bjee19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes help a bunch in clarity, thanks!

@ciarams87 ciarams87 force-pushed the fix/improve-conn-reset-handling branch from 5d7fef2 to 7fabec5 Compare September 17, 2025 16:43
@ciarams87 ciarams87 enabled auto-merge (squash) September 17, 2025 17:42
@ciarams87 ciarams87 merged commit 24e0cb6 into main Sep 17, 2025
64 of 66 checks passed
@ciarams87 ciarams87 deleted the fix/improve-conn-reset-handling branch September 17, 2025 17:43
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NGINX Gateway Fabric Sep 17, 2025
ciarams87 added a commit that referenced this pull request Sep 17, 2025
…#3905)

Problem: During ServiceAccountToken rotation, nginx-gateway-fabric was sometimes experiencing deadlocks due to the Subscribe method incorrectly intercepting initial configuration operations after a ServiceAccountToken rotation and incorrect signaling broadcast completion

Solution: Implemented a pendingRequest tracking mechanism to distinguish between initial setup operations and broadcast operations.
ciarams87 added a commit that referenced this pull request Sep 17, 2025
…#3905) (#3932)

Problem: During ServiceAccountToken rotation, nginx-gateway-fabric was sometimes experiencing deadlocks due to the Subscribe method incorrectly intercepting initial configuration operations after a ServiceAccountToken rotation and incorrect signaling broadcast completion

Solution: Implemented a pendingRequest tracking mechanism to distinguish between initial setup operations and broadcast operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release-notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Data plane does not sync upstream servers IPs
3 participants