-
Notifications
You must be signed in to change notification settings - Fork 6.4k
fix: Improve conversion webhook error isolation (fixes #20828) #23425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: Improve conversion webhook error isolation (fixes #20828) #23425
Conversation
❌ Preview Environment undeployed from BunnyshellAvailable commands (reply to this comment):
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #23425 +/- ##
=========================================
Coverage ? 57.83%
=========================================
Files ? 349
Lines ? 60359
Branches ? 0
=========================================
Hits ? 34910
Misses ? 22601
Partials ? 2848 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in Contributor meeting. This should document that is the experience when the cache is incomplete and we decide not to fail and proceed with an uncompleted cache.
- What are the impact on the user experience?
- How can we document to or surface it to operator? Failure metrics + logs?
- If there is an impact on an Application, how can we surface the reason why the application is in a "partially functional" state? Application Warning conditions?
16a7503
to
79348a5
Compare
79348a5
to
23aa005
Compare
45dd1c9
to
5bafcf7
Compare
@agaudreault here is a google drive link to videos with broken and fixed examples. scripts I used can be found in https://github.com/jcogilvie/conversion-webhook-repro |
eb13003
to
d0b3b80
Compare
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]> # Conflicts: # controller/appcontroller_test.go
Signed-off-by: Jonathan Ogilvie <[email protected]>
…ge; provide degraded app status on conversion errors Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]> # Conflicts: # test/e2e/conversion_webhook_failure_isolation_test.go
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
…n tests Signed-off-by: Jonathan Ogilvie <[email protected]>
da89e79
to
5978362
Compare
@@ -299,6 +299,7 @@ require ( | |||
) | |||
|
|||
replace ( | |||
github.com/argoproj/gitops-engine => github.com/Sanyaku/gitops-engine v0.0.0-20250827211909-3b71e335c677 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously this will have to be removed prior to merge, and after my upstream PR into gitops-engine merges.
Fixes #20828
Requires argoproj/gitops-engine#772
Currently when fetching the cluster cache, the unavailability of any single conversion webhook can halt the process and result in all applications going
Unknown
. This PR adds a test to reproduce the scenario and adds (admittedly ham-handed) error detection for this case to allow cache processing to continue in a degraded state, alleviating the issue.Checklist:
Summary by way of claude.
Cluster Cache Taint Management and Conversion Webhook Error Isolation
Overview
This PR introduces a comprehensive error isolation system for ArgoCD that intelligently handles cluster-level failures, particularly conversion webhook errors, while maintaining application availability for unaffected resources.
Key Features
1. Cluster Cache Taint Management System
2. Intelligent Error Classification
3. Enhanced Observability
Technical Implementation
Core Components
controller/cache/cache.go
: Implements taint manager and partial cache logiccontroller/state.go
: Integrates tainted GVK information into application state managementserver/cluster/cluster.go
: Exposes tainted GVK information via cluster APIutil/errors/webhook.go
: Specialized conversion webhook error detectionAPI Changes
TaintedGVKs []string
added to cluster status for API exposureUI Enhancements
Testing Coverage
Benefits
Configuration
Files Modified: 50 files (+4,254 lines, -824 lines)
Backward Compatibility: Fully maintained
UI Components: Enhanced cluster management with degraded state visualization
This implementation addresses critical production scenarios where conversion webhooks or other cluster-level failures should not impact unrelated application deployments, significantly improving ArgoCD's resilience in complex Kubernetes environments.