Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Sep 11, 2025

What does this PR do?

Ensures Elastic Agent gracefully shuts down the EDOT (collector) subprocess when the agent receives a terminating signal (e.g., SIGTERM) - i.e., when the OTelManager context is cancelled - instead of immediately killing it.

Key changes:

  • Replace collectorHandle.Stop(ctx context.Context) with Stop(waitTime time.Duration) and honor it in both embedded and subprocess execution modes.
  • Set waitTimeForStop to 30s (aligned with Beats subprocess defaults).
  • Add warnings when the supervised collector fails to stop gracefully or times out.
  • Make updateCh channel buffered to a size of 1 and drain before write (pattern used elsewhere) to avoid shutdown/reconfig delays.
  • Unit-Tests:
    • Extend test harness to allow configurable shutdown delays via TEST_SUPERVISED_COLLECTOR_DELAY.
    • Validate both outcomes:
      • Delay > 30s ⇒ subprocess is force-killed.
      • Delay < 30s ⇒ subprocess exits cleanly within the timeout.

Why is it important?

Previously, on shutdown the agent could terminate in a way that kills EDOT immediately, risking incomplete cleanup of telemetry pipelines and leaving resources in a bad state. Waiting (up to 30s) for a graceful EDOT exit improves cleanup, stability, and hybrid agent resilience.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive impact expected. The only behavior change is during shutdown: the agent now waits up to 30s for EDOT to exit gracefully before force-killing it. This improves stability and predictability for users..

How to test this PR locally

mage unitTest

Related issues

N/A

@pkoutsovasilis pkoutsovasilis self-assigned this Sep 11, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.19 Automated backport to the 8.19 branch labels Sep 11, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/gracefully_shutdown_edot_if_agent_shutdown branch from 0194d5f to e114ea5 Compare September 11, 2025 13:27
@pkoutsovasilis pkoutsovasilis force-pushed the fix/gracefully_shutdown_edot_if_agent_shutdown branch from e114ea5 to 455fd59 Compare September 11, 2025 15:32
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review September 11, 2025 20:50
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner September 11, 2025 20:50
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks and a question on the select statement

Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pkoutsovasilis

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some minor nitpicks and a non-blocking question.

Comment on lines +50 to +52
if shutdownDelay > 0 {
<-time.After(shutdownDelay)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we move this check above the conditional instruction? It's identical in both branches.

@@ -402,6 +405,15 @@ func (m *OTelManager) Update(cfg *confmap.Conf, components []component.Component
collectorCfg: cfg,
components: components,
}

// we care only about the latest config update
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the caller waiting for 30 seconds, blocked on the update channel, while the collector process exits, what we're trying to avoid here? I don't mind the solution, but whenever I see this pattern, I wonder if there's an architecturally more elegant way of dealing with this. Or, if it isn't simpler to have a larger buffered channel and return a serious error if it's full. WDYT? Mostly thinking out loud, this shouldn't be a blocker for merging.

Comment on lines +505 to +509
err := os.Setenv("TEST_SUPERVISED_COLLECTOR_DELAY", delayDuration.String())
require.NoError(t, err, "failed to set TEST_SUPERVISED_COLLECTOR_DELAY env var")
t.Cleanup(func() {
_ = os.Unsetenv("TEST_SUPERVISED_COLLECTOR_DELAY")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using t.SetEnv would give the same effect with less code.

Comment on lines +562 to +566
err := os.Setenv("TEST_SUPERVISED_COLLECTOR_DELAY", delayDuration.String())
require.NoError(t, err, "failed to set TEST_SUPERVISED_COLLECTOR_DELAY env var")
t.Cleanup(func() {
_ = os.Unsetenv("TEST_SUPERVISED_COLLECTOR_DELAY")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@@ -34,6 +34,8 @@ type ExecutionMode string
const (
SubprocessExecutionMode ExecutionMode = "subprocess"
EmbeddedExecutionMode ExecutionMode = "embedded"
// waitTimeForStop is the time to wait for the collector to stop before killing it.
waitTimeForStop = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want this to be configurable in case it turns out to be too long in some circumstances. Not urgent or blocking, we can do this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.19 Automated backport to the 8.19 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants