Skip to content

Conversation

jcogilvie
Copy link

@jcogilvie jcogilvie commented Jun 16, 2025

Fixes #20828
Requires argoproj/gitops-engine#772

Currently when fetching the cluster cache, the unavailability of any single conversion webhook can halt the process and result in all applications going Unknown. This PR adds a test to reproduce the scenario and adds (admittedly ham-handed) error detection for this case to allow cache processing to continue in a degraded state, alleviating the issue.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
    • n/a
  • Does this PR require documentation updates?
    • no
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Summary by way of claude.

Cluster Cache Taint Management and Conversion Webhook Error Isolation

Overview

This PR introduces a comprehensive error isolation system for ArgoCD that intelligently handles cluster-level failures, particularly conversion webhook errors, while maintaining application availability for unaffected resources.

Key Features

1. Cluster Cache Taint Management System

  • Tainted GVK Tracking: Implements a thread-safe taint manager that tracks failed Group-Version-Kinds (GVKs) per cluster with configurable TTL (default: 10 minutes)
  • Partial Cache Strategy: When specific GVKs fail (e.g., due to conversion webhook errors), the system continues serving cached data for working resources while isolating failures
  • Automatic Recovery: Tainted GVKs automatically expire and recover when the underlying issues are resolved

2. Intelligent Error Classification

  • Two-Stage Validation: Combines tainted GVK presence with error pattern matching for precise failure isolation
  • Conversion Webhook Detection: Specialized error patterns detect conversion webhook failures and handle them gracefully
  • Graceful Degradation: Applications using unaffected resources remain healthy while problematic resources are properly isolated

3. Enhanced Observability

  • Cluster Status Integration: Cluster management UI now displays degraded status when tainted GVKs are present
  • Application Health Propagation: Applications affected by tainted GVKs receive appropriate degraded health status
  • Detailed Error Context: Enhanced error reporting provides clear visibility into conversion webhook and other cluster-level failures

Technical Implementation

Core Components

  • controller/cache/cache.go: Implements taint manager and partial cache logic
  • controller/state.go: Integrates tainted GVK information into application state management
  • server/cluster/cluster.go: Exposes tainted GVK information via cluster API
  • util/errors/webhook.go: Specialized conversion webhook error detection

API Changes

  • New field: TaintedGVKs []string added to cluster status for API exposure
  • Backward compatible: All changes maintain existing API compatibility

UI Enhancements

  • Cluster Details Page: Displays tainted GVKs and degraded status
  • Visual Indicators: Clear UI feedback for cluster health issues
  • E2E Test Coverage: Comprehensive browser automation tests for UI components

Testing Coverage

  • 2 new E2E tests covering cluster degraded state and conversion webhook failure isolation scenarios
  • Comprehensive unit test coverage across controller, cache, and cluster management components
  • UI automation tests for cluster details page and tainted GVK visualization

Benefits

  1. Improved Resilience: Applications using working resources remain operational during partial cluster failures
  2. Better UX: Clear visibility into cluster health issues through UI and API
  3. Automatic Recovery: System self-heals when underlying issues are resolved
  4. Production Safety: Isolated failures prevent cascade effects across unrelated applications

Configuration

  • TTL Configurable: Taint expiration time can be adjusted per deployment needs
  • Pattern Matching: Conversion webhook error patterns can be extended for new scenarios
  • Feature Flags: Can be disabled if needed for rollback scenarios

Files Modified: 50 files (+4,254 lines, -824 lines)
Backward Compatibility: Fully maintained
UI Components: Enhanced cluster management with degraded state visualization

This implementation addresses critical production scenarios where conversion webhooks or other cluster-level failures should not impact unrelated application deployments, significantly improving ArgoCD's resilience in complex Kubernetes environments.

@jcogilvie jcogilvie requested a review from a team as a code owner June 16, 2025 17:29
Copy link

bunnyshell bot commented Jun 16, 2025

❌ Preview Environment undeployed from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

Copy link

codecov bot commented Jun 16, 2025

Codecov Report

❌ Patch coverage is 75.71429% with 119 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@88a32d6). Learn more about missing BASE report.
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
controller/appcontroller.go 70.07% 34 Missing and 7 partials ⚠️
controller/state.go 53.33% 34 Missing and 1 partial ⚠️
util/errors/webhook.go 0.00% 16 Missing ⚠️
server/cluster/cluster.go 76.00% 12 Missing ⚠️
controller/cache/cache.go 94.87% 7 Missing and 1 partial ⚠️
controller/clusterinfoupdater.go 89.18% 3 Missing and 1 partial ⚠️
controller/sync.go 66.66% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #23425   +/-   ##
=========================================
  Coverage          ?   57.83%           
=========================================
  Files             ?      349           
  Lines             ?    60359           
  Branches          ?        0           
=========================================
  Hits              ?    34910           
  Misses            ?    22601           
  Partials          ?     2848           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Contributor meeting. This should document that is the experience when the cache is incomplete and we decide not to fail and proceed with an uncompleted cache.

  • What are the impact on the user experience?
  • How can we document to or surface it to operator? Failure metrics + logs?
  • If there is an impact on an Application, how can we surface the reason why the application is in a "partially functional" state? Application Warning conditions?

@jcogilvie jcogilvie force-pushed the conversion-webhook-failure-isolation branch from 16a7503 to 79348a5 Compare August 16, 2025 13:44
@jcogilvie jcogilvie force-pushed the conversion-webhook-failure-isolation branch from 79348a5 to 23aa005 Compare August 28, 2025 20:17
@jcogilvie jcogilvie requested a review from a team as a code owner August 28, 2025 22:32
@jcogilvie jcogilvie force-pushed the conversion-webhook-failure-isolation branch 2 times, most recently from 45dd1c9 to 5bafcf7 Compare August 29, 2025 16:02
@jcogilvie
Copy link
Author

@agaudreault here is a google drive link to videos with broken and fixed examples.

scripts I used can be found in https://github.com/jcogilvie/conversion-webhook-repro

Signed-off-by: Jonathan Ogilvie <[email protected]>

# Conflicts:
#	controller/appcontroller_test.go
Signed-off-by: Jonathan Ogilvie <[email protected]>
…ge; provide degraded app status on conversion errors

Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>

# Conflicts:
#	test/e2e/conversion_webhook_failure_isolation_test.go
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
Signed-off-by: Jonathan Ogilvie <[email protected]>
@jcogilvie jcogilvie force-pushed the conversion-webhook-failure-isolation branch from da89e79 to 5978362 Compare September 2, 2025 21:50
@@ -299,6 +299,7 @@ require (
)

replace (
github.com/argoproj/gitops-engine => github.com/Sanyaku/gitops-engine v0.0.0-20250827211909-3b71e335c677
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously this will have to be removed prior to merge, and after my upstream PR into gitops-engine merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If any conversion webhook on any CRD isn't available, all apps on the cluster go to an "unknown" state.
2 participants