-
Notifications
You must be signed in to change notification settings - Fork 188
fix: ensure EDOT subprocess shuts down gracefully on agent termination #9886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: ensure EDOT subprocess shuts down gracefully on agent termination #9886
Conversation
0194d5f
to
e114ea5
Compare
e114ea5
to
455fd59
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitpicks and a question on the select statement
|
💚 Build Succeeded
History
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some minor nitpicks and a non-blocking question.
if shutdownDelay > 0 { | ||
<-time.After(shutdownDelay) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we move this check above the conditional instruction? It's identical in both branches.
@@ -402,6 +405,15 @@ func (m *OTelManager) Update(cfg *confmap.Conf, components []component.Component | |||
collectorCfg: cfg, | |||
components: components, | |||
} | |||
|
|||
// we care only about the latest config update | |||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the caller waiting for 30 seconds, blocked on the update channel, while the collector process exits, what we're trying to avoid here? I don't mind the solution, but whenever I see this pattern, I wonder if there's an architecturally more elegant way of dealing with this. Or, if it isn't simpler to have a larger buffered channel and return a serious error if it's full. WDYT? Mostly thinking out loud, this shouldn't be a blocker for merging.
err := os.Setenv("TEST_SUPERVISED_COLLECTOR_DELAY", delayDuration.String()) | ||
require.NoError(t, err, "failed to set TEST_SUPERVISED_COLLECTOR_DELAY env var") | ||
t.Cleanup(func() { | ||
_ = os.Unsetenv("TEST_SUPERVISED_COLLECTOR_DELAY") | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using t.SetEnv would give the same effect with less code.
err := os.Setenv("TEST_SUPERVISED_COLLECTOR_DELAY", delayDuration.String()) | ||
require.NoError(t, err, "failed to set TEST_SUPERVISED_COLLECTOR_DELAY env var") | ||
t.Cleanup(func() { | ||
_ = os.Unsetenv("TEST_SUPERVISED_COLLECTOR_DELAY") | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
@@ -34,6 +34,8 @@ type ExecutionMode string | |||
const ( | |||
SubprocessExecutionMode ExecutionMode = "subprocess" | |||
EmbeddedExecutionMode ExecutionMode = "embedded" | |||
// waitTimeForStop is the time to wait for the collector to stop before killing it. | |||
waitTimeForStop = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want this to be configurable in case it turns out to be too long in some circumstances. Not urgent or blocking, we can do this later.
What does this PR do?
Ensures Elastic Agent gracefully shuts down the EDOT (collector) subprocess when the agent receives a terminating signal (e.g.,
SIGTERM
) - i.e., when theOTelManager
context is cancelled - instead of immediately killing it.Key changes:
collectorHandle.Stop(ctx context.Context)
withStop(waitTime time.Duration)
and honor it in both embedded and subprocess execution modes.waitTimeForStop
to 30s (aligned with Beats subprocess defaults).updateCh
channel buffered to a size of 1 and drain before write (pattern used elsewhere) to avoid shutdown/reconfig delays.TEST_SUPERVISED_COLLECTOR_DELAY
.Why is it important?
Previously, on shutdown the agent could terminate in a way that kills EDOT immediately, risking incomplete cleanup of telemetry pipelines and leaving resources in a bad state. Waiting (up to 30s) for a graceful EDOT exit improves cleanup, stability, and hybrid agent resilience.
Checklist
./changelog/fragments
using the changelog toolDisruptive User Impact
No disruptive impact expected. The only behavior change is during shutdown: the agent now waits up to 30s for EDOT to exit gracefully before force-killing it. This improves stability and predictability for users..
How to test this PR locally
Related issues
N/A