[AdaptiveSampling] Improve sampling rule propagation + improvements #1186
+398
−164
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problems
Propagators
Customers using different environments are provided different default propagators, e.g.
tracecontext,baggage,xray
vsbaggage,xray,tracecontext
, resulting in differing behaviour between. This is especially true because the AwsXrayPropagator is currently bugged and does not allow for tracestate propagation between services if it after the regulartracecontext
propagator in the list of propagators. This bug exists in Java as well as in other languages.Inaccurate anomaly statistics
Anomalies are currently always counted so long as the rule propagated from an upstream service or the otherwise matched sampling rule have boost enabled. However, this assumes that the matched rule is relevant to the current call chain. Instead, we want to check if the call came from a service that is already instrumented by checking if the provided parentContext is valid.
Changes
Propagators
Inaccurate anomaly statistics
shouldSample
to check if the upstream context is valid in case we don't want to propagate the matched sampling rule, concluding that the current service is not the root serviceGeneral refactors and improvements
AnomalyDetectionResult isAnomaly(ReadableSpan span, SpanData spanData)
which returns an object, e.g.result
, that providesresult.shouldBoostSampling()
andresult. shouldCaptureAnomalySpan()
void updateTraceUsageCache(String traceId, boolean isSpanCaptured, boolean isCountedAsAnomalyForBoost)
anomalyConditions: []
instead of needinganomalyConditions: [{}]
- more intuitive(anomalyConditions != null && !anomalyConditions.isEmpty())
to(anomalyConditions != null)
Testing
Deployed to demo environment and tested the following scenarios where each scenario is a list of services called in series instrumented with the listed SDK version in Java
tracecontext,baggage,xray
-> new SDK: Validated third service still provided boost statistics for the first service despite a broken xray propagator in between, relying on baggage instead of trace statebaggage,xray,tracecontext
-> new SDK: Validated third service provided boost statistics for the first service on the happy case where trace state is propagated correctlyBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.