Skip to content

Commit acb9ec3

Browse files
MadLittleModsanoadragon453
authored andcommitted
Fix run_coroutine_in_background(...) incorrectly handling logcontext (#18964)
Regressed in #18900 (comment) (see conversation there for more context) ### How is this a regression? > To give this an update with more hindsight; this logic *was* redundant with the early return and it is safe to remove this complexity :white_check_mark: > > It seems like this actually has to do with completed vs incomplete deferreds... > > To explain how things previously worked *without* the early-return shortcut: > > With the normal case of **incomplete awaitable**, we store the `calling_context` and the `f` function is called and runs until it yields to the reactor. Because `f` follows the logcontext rules, it sets the `sentinel` logcontext. Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is `sentinel`) and return. When the deferred completes, we restore `ctx` (which is `sentinel`) before yielding to the reactor again (all good :white_check_mark:) > > With the other case where we see a **completed awaitable**, we store the `calling_context` and the `f` function is called and runs to completion (no logcontext change). *This is where the shortcut would kick in but I'm going to continue explaining as if we commented out the shortcut.* -- Then in `run_in_background(...)`, we restore the `calling_context`, store the current `ctx` (which is same as the `calling_context`). Because the deferred is already completed, our extra callback is called immediately and we restore `ctx` (which is same as the `calling_context`). Since we never yield to the reactor, the `calling_context` is perfect as that's what we want again (all good :white_check_mark:) > > --- > > But this also means that our early-return shortcut is no longer just an optimization and is *necessary* to act correctly in the **completed awaitable** case as we want to return with the `calling_context` and not reset to the `sentinel` context. I've updated the comment in #18964 to explain the necessity as it's currently just described as an optimization. > > But because we made the same change to `run_coroutine_in_background(...)` which didn't have the same early-return shortcut, we regressed the correct behavior ❌ . This is being fixed in #18964 > > > *-- @MadLittleMods, #18900 (comment) ### How did we find this problem? Spawning from @wrjlewis [seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io) `Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost` in the logs: <details> <summary>More logs</summary> ``` synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context. 2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost 2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response. 2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis connected. 2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/ ``` </details> From the logs there, we can see things relating to `typing.get_new_events` and `/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led me to trying out Sliding Sync with the typing extension enabled and allowed me to reproduce the problem locally. Sliding Sync is a unique scenario as it's the only place we use `gather_optional_coroutines(...)` -> `run_coroutine_in_background(...)` (introduced in #17884) to exhibit this behavior. ### Testing strategy 1. Configure Synapse to enable [MSC4186](matrix-org/matrix-spec-proposals#4186): Simplified Sliding Sync which is actually under [MSC3575](matrix-org/matrix-spec-proposals#3575) ```yaml experimental_features: msc3575_enabled: true ``` 1. Start synapse: `poetry run synapse_homeserver --config-path homeserver.yaml` 1. Make a Sliding Sync request with one of the extensions enabled ```http POST http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync { "lists": {}, "room_subscriptions": { "!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": { "required_state": [], "timeline_limit": 1 } }, "extensions": { "typing": { "enabled": true } } } ``` 1. Open your homeserver logs and notice warnings about `Starting ... from sentinel context: metrics will be lost`
1 parent 9c4ba13 commit acb9ec3

File tree

3 files changed

+181
-87
lines changed

3 files changed

+181
-87
lines changed

changelog.d/18964.misc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Fix `run_coroutine_in_background(...)` incorrectly handling logcontext.

synapse/logging/context.py

Lines changed: 36 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -802,13 +802,15 @@ def run_in_background(
802802
deferred returned by the function completes.
803803
804804
To explain how the log contexts work here:
805-
- When `run_in_background` is called, the current context is stored ("original"),
806-
we kick off the background task in the current context, and we restore that
807-
original context before returning
808-
- When the background task finishes, we don't want to leak our context into the
809-
reactor which would erroneously get attached to the next operation picked up by
810-
the event loop. We add a callback to the deferred which will clear the logging
811-
context after it finishes and yields control back to the reactor.
805+
- When `run_in_background` is called, the calling logcontext is stored
806+
("original"), we kick off the background task in the current context, and we
807+
restore that original context before returning.
808+
- For a completed deferred, that's the end of the story.
809+
- For an incomplete deferred, when the background task finishes, we don't want to
810+
leak our context into the reactor which would erroneously get attached to the
811+
next operation picked up by the event loop. We add a callback to the deferred
812+
which will clear the logging context after it finishes and yields control back to
813+
the reactor.
812814
813815
Useful for wrapping functions that return a deferred or coroutine, which you don't
814816
yield or await on (for instance because you want to pass it to
@@ -857,22 +859,36 @@ def run_in_background(
857859

858860
# The deferred has already completed
859861
if d.called and not d.paused:
860-
# The function should have maintained the logcontext, so we can
861-
# optimise out the messing about
862+
# If the function messes with logcontexts, we can assume it follows the Synapse
863+
# logcontext rules (Rules for functions returning awaitables: "If the awaitable
864+
# is already complete, the function returns with the same logcontext it started
865+
# with."). If it function doesn't touch logcontexts at all, we can also assume
866+
# the logcontext is unchanged.
867+
#
868+
# Either way, the function should have maintained the calling logcontext, so we
869+
# can avoid messing with it further. Additionally, if the deferred has already
870+
# completed, then it would be a mistake to then add a deferred callback (below)
871+
# to reset the logcontext to the sentinel logcontext as that would run
872+
# immediately (remember our goal is to maintain the calling logcontext when we
873+
# return).
862874
return d
863875

864-
# The function may have reset the context before returning, so we need to restore it
865-
# now.
876+
# Since the function we called may follow the Synapse logcontext rules (Rules for
877+
# functions returning awaitables: "If the awaitable is incomplete, the function
878+
# clears the logcontext before returning"), the function may have reset the
879+
# logcontext before returning, so we need to restore the calling logcontext now
880+
# before we return ourselves.
866881
#
867882
# Our goal is to have the caller logcontext unchanged after firing off the
868883
# background task and returning.
869884
set_current_context(calling_context)
870885

871-
# The original logcontext will be restored when the deferred completes, but
872-
# there is nothing waiting for it, so it will get leaked into the reactor (which
873-
# would then get picked up by the next thing the reactor does). We therefore
874-
# need to reset the logcontext here (set the `sentinel` logcontext) before
875-
# yielding control back to the reactor.
886+
# If the function we called is playing nice and following the Synapse logcontext
887+
# rules, it will restore original calling logcontext when the deferred completes;
888+
# but there is nothing waiting for it, so it will get leaked into the reactor (which
889+
# would then get picked up by the next thing the reactor does). We therefore need to
890+
# reset the logcontext here (set the `sentinel` logcontext) before yielding control
891+
# back to the reactor.
876892
#
877893
# (If this feels asymmetric, consider it this way: we are
878894
# effectively forking a new thread of execution. We are
@@ -894,10 +910,9 @@ def run_coroutine_in_background(
894910
Useful for wrapping coroutines that you don't yield or await on (for
895911
instance because you want to pass it to deferred.gatherResults()).
896912
897-
This is a special case of `run_in_background` where we can accept a
898-
coroutine directly rather than a function. We can do this because coroutines
899-
do not run until called, and so calling an async function without awaiting
900-
cannot change the log contexts.
913+
This is a special case of `run_in_background` where we can accept a coroutine
914+
directly rather than a function. We can do this because coroutines do not continue
915+
running once they have yielded.
901916
902917
This is an ergonomic helper so we can do this:
903918
```python
@@ -908,33 +923,7 @@ def run_coroutine_in_background(
908923
run_in_background(lambda: func1(arg1))
909924
```
910925
"""
911-
calling_context = current_context()
912-
913-
# Wrap the coroutine in a deferred, which will have the side effect of executing the
914-
# coroutine in the background.
915-
d = defer.ensureDeferred(coroutine)
916-
917-
# The function may have reset the context before returning, so we need to restore it
918-
# now.
919-
#
920-
# Our goal is to have the caller logcontext unchanged after firing off the
921-
# background task and returning.
922-
set_current_context(calling_context)
923-
924-
# The original logcontext will be restored when the deferred completes, but
925-
# there is nothing waiting for it, so it will get leaked into the reactor (which
926-
# would then get picked up by the next thing the reactor does). We therefore
927-
# need to reset the logcontext here (set the `sentinel` logcontext) before
928-
# yielding control back to the reactor.
929-
#
930-
# (If this feels asymmetric, consider it this way: we are
931-
# effectively forking a new thread of execution. We are
932-
# probably currently within a ``with LoggingContext()`` block,
933-
# which is supposed to have a single entry and exit point. But
934-
# by spawning off another deferred, we are effectively
935-
# adding a new exit point.)
936-
d.addBoth(_set_context_cb, SENTINEL_CONTEXT)
937-
return d
926+
return run_in_background(lambda: coroutine)
938927

939928

940929
T = TypeVar("T")

tests/util/test_logcontext.py

Lines changed: 144 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
import logging
2323
from typing import Callable, Generator, cast
2424

25-
import twisted.python.failure
2625
from twisted.internet import defer, reactor as _reactor
2726

2827
from synapse.logging.context import (
@@ -33,6 +32,7 @@
3332
current_context,
3433
make_deferred_yieldable,
3534
nested_logging_context,
35+
run_coroutine_in_background,
3636
run_in_background,
3737
)
3838
from synapse.types import ISynapseReactor
@@ -249,88 +249,192 @@ async def competing_callback() -> None:
249249
# Back to the sentinel context
250250
self._check_test_key("sentinel")
251251

252-
def _test_run_in_background(self, function: Callable[[], object]) -> defer.Deferred:
253-
sentinel_context = current_context()
252+
async def _test_run_in_background(self, function: Callable[[], object]) -> None:
253+
clock = Clock(reactor)
254+
255+
# Sanity check that we start in the sentinel context
256+
self._check_test_key("sentinel")
254257

255-
callback_completed = False
258+
callback_finished = False
256259

257260
with LoggingContext("foo"):
258-
# fire off function, but don't wait on it.
259-
d2 = run_in_background(function)
261+
# Fire off the function, but don't wait on it.
262+
deferred = run_in_background(function)
263+
self._check_test_key("foo")
260264

261-
def cb(res: object) -> object:
262-
nonlocal callback_completed
263-
callback_completed = True
264-
return res
265+
def callback(result: object) -> object:
266+
nonlocal callback_finished
267+
callback_finished = True
268+
# Pass through the result
269+
return result
265270

266-
d2.addCallback(cb)
271+
# We `addBoth` because when exceptions happen, we still want to mark the
272+
# callback as finished so that the test can complete and we see the
273+
# underlying error.
274+
deferred.addBoth(callback)
267275

268276
self._check_test_key("foo")
269277

270-
# now wait for the function under test to have run, and check that
271-
# the logcontext is left in a sane state.
272-
d2 = defer.Deferred()
273-
274-
def check_logcontext() -> None:
275-
if not callback_completed:
276-
reactor.callLater(0.01, check_logcontext)
277-
return
278+
# Now wait for the function under test to have run, and check that
279+
# the logcontext is left in a sane state.
280+
while not callback_finished:
281+
await clock.sleep(0)
282+
self._check_test_key("foo")
278283

279-
# make sure that the context was reset before it got thrown back
280-
# into the reactor
281-
try:
282-
self.assertIs(current_context(), sentinel_context)
283-
d2.callback(None)
284-
except BaseException:
285-
d2.errback(twisted.python.failure.Failure())
286-
287-
reactor.callLater(0.01, check_logcontext)
284+
self.assertTrue(
285+
callback_finished,
286+
"Callback never finished which means the test probably didn't wait long enough",
287+
)
288288

289-
# test is done once d2 finishes
290-
return d2
289+
# Back to the sentinel context
290+
self._check_test_key("sentinel")
291291

292292
@logcontext_clean
293-
def test_run_in_background_with_blocking_fn(self) -> defer.Deferred:
293+
async def test_run_in_background_with_blocking_fn(self) -> None:
294294
async def blocking_function() -> None:
295295
await Clock(reactor).sleep(0)
296296

297-
return self._test_run_in_background(blocking_function)
297+
await self._test_run_in_background(blocking_function)
298298

299299
@logcontext_clean
300-
def test_run_in_background_with_non_blocking_fn(self) -> defer.Deferred:
300+
async def test_run_in_background_with_non_blocking_fn(self) -> None:
301301
@defer.inlineCallbacks
302302
def nonblocking_function() -> Generator["defer.Deferred[object]", object, None]:
303303
with PreserveLoggingContext():
304304
yield defer.succeed(None)
305305

306-
return self._test_run_in_background(nonblocking_function)
306+
await self._test_run_in_background(nonblocking_function)
307307

308308
@logcontext_clean
309-
def test_run_in_background_with_chained_deferred(self) -> defer.Deferred:
309+
async def test_run_in_background_with_chained_deferred(self) -> None:
310310
# a function which returns a deferred which looks like it has been
311311
# called, but is actually paused
312312
def testfunc() -> defer.Deferred:
313313
return make_deferred_yieldable(_chained_deferred_function())
314314

315-
return self._test_run_in_background(testfunc)
315+
await self._test_run_in_background(testfunc)
316316

317317
@logcontext_clean
318-
def test_run_in_background_with_coroutine(self) -> defer.Deferred:
318+
async def test_run_in_background_with_coroutine(self) -> None:
319+
"""
320+
Test `run_in_background` with a coroutine that yields control back to the
321+
reactor.
322+
323+
This will stress the logic around incomplete deferreds in `run_in_background`.
324+
"""
325+
319326
async def testfunc() -> None:
320327
self._check_test_key("foo")
321328
d = defer.ensureDeferred(Clock(reactor).sleep(0))
322329
self.assertIs(current_context(), SENTINEL_CONTEXT)
323330
await d
324331
self._check_test_key("foo")
325332

326-
return self._test_run_in_background(testfunc)
333+
await self._test_run_in_background(testfunc)
327334

328335
@logcontext_clean
329-
def test_run_in_background_with_nonblocking_coroutine(self) -> defer.Deferred:
336+
async def test_run_in_background_with_nonblocking_coroutine(self) -> None:
337+
"""
338+
Test `run_in_background` with a "nonblocking" coroutine (never yields control
339+
back to the reactor).
340+
341+
This will stress the logic around completed deferreds in `run_in_background`.
342+
"""
343+
330344
async def testfunc() -> None:
331345
self._check_test_key("foo")
332346

333-
return self._test_run_in_background(testfunc)
347+
await self._test_run_in_background(testfunc)
348+
349+
@logcontext_clean
350+
async def test_run_coroutine_in_background(self) -> None:
351+
"""
352+
Test `run_coroutine_in_background` with a coroutine that yields control back to the
353+
reactor.
354+
355+
This will stress the logic around incomplete deferreds in `run_coroutine_in_background`.
356+
"""
357+
clock = Clock(reactor)
358+
359+
# Sanity check that we start in the sentinel context
360+
self._check_test_key("sentinel")
361+
362+
callback_finished = False
363+
364+
async def competing_callback() -> None:
365+
nonlocal callback_finished
366+
try:
367+
# The callback should have the same logcontext as the caller
368+
self._check_test_key("foo")
369+
370+
with LoggingContext("competing"):
371+
await clock.sleep(0)
372+
self._check_test_key("competing")
373+
374+
self._check_test_key("foo")
375+
finally:
376+
# When exceptions happen, we still want to mark the callback as finished
377+
# so that the test can complete and we see the underlying error.
378+
callback_finished = True
379+
380+
with LoggingContext("foo"):
381+
run_coroutine_in_background(competing_callback())
382+
self._check_test_key("foo")
383+
await clock.sleep(0)
384+
self._check_test_key("foo")
385+
386+
self.assertTrue(
387+
callback_finished,
388+
"Callback never finished which means the test probably didn't wait long enough",
389+
)
390+
391+
# Back to the sentinel context
392+
self._check_test_key("sentinel")
393+
394+
@logcontext_clean
395+
async def test_run_coroutine_in_background_with_nonblocking_coroutine(self) -> None:
396+
"""
397+
Test `run_coroutine_in_background` with a "nonblocking" coroutine (never yields control
398+
back to the reactor).
399+
400+
This will stress the logic around completed deferreds in `run_coroutine_in_background`.
401+
"""
402+
# Sanity check that we start in the sentinel context
403+
self._check_test_key("sentinel")
404+
405+
callback_finished = False
406+
407+
async def competing_callback() -> None:
408+
nonlocal callback_finished
409+
try:
410+
# The callback should have the same logcontext as the caller
411+
self._check_test_key("foo")
412+
413+
with LoggingContext("competing"):
414+
# We `await` here but there is nothing to wait for here since the
415+
# deferred is already complete so we should immediately continue
416+
# executing in the same context.
417+
await defer.succeed(None)
418+
419+
self._check_test_key("competing")
420+
421+
self._check_test_key("foo")
422+
finally:
423+
# When exceptions happen, we still want to mark the callback as finished
424+
# so that the test can complete and we see the underlying error.
425+
callback_finished = True
426+
427+
with LoggingContext("foo"):
428+
run_coroutine_in_background(competing_callback())
429+
self._check_test_key("foo")
430+
431+
self.assertTrue(
432+
callback_finished,
433+
"Callback never finished which means the test probably didn't wait long enough",
434+
)
435+
436+
# Back to the sentinel context
437+
self._check_test_key("sentinel")
334438

335439
@logcontext_clean
336440
@defer.inlineCallbacks

0 commit comments

Comments
 (0)