Skip to content

Conversation

edolstra
Copy link
Member

@edolstra edolstra commented Sep 7, 2025

Motivation

In the multithreaded evaluator, it's possible for multiple threads to wait on the same thunk. If evaluation of the thunk results in an exception, the waiting threads shouldn't try to re-force the thunk. Instead, they should rethrow the same exception, without duplicating any work.

Therefore, there is now a new value type tFailed that stores an std::exception_ptr. If evaluation of a thunk/app results in an exception, forceValue() overwrites the value with a tFailed. If forceValue() encounters a tFailed, it rethrows the exception. So you normally never need to check for failed values (since forcing them causes a rethrow).

Context

Extracted from DeterminateSystems#125.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@edolstra edolstra requested a review from roberth as a code owner September 7, 2025 09:52
@github-actions github-actions bot added the c api Nix as a C library with a stable interface label Sep 7, 2025
@github-actions github-actions bot added with-tests Issues related to testing. PRs with tests have some priority documentation labels Sep 7, 2025
@roberth roberth self-assigned this Sep 9, 2025
@roberth
Copy link
Member

roberth commented Sep 9, 2025

Things done:

  • Restore the tests to their original master expected output, which was better
  • Move cold path code to non-inline methods
  • Breaking change and resolution regarding recoverability
    • Add recoverability to C API for NixOps4 and potentially other consumers
    • Release note
  • Add test

@edolstra
Copy link
Member Author

edolstra commented Sep 9, 2025

I'm not so keen on adding retry-able thunks to this PR, since parallel eval was written under the assumption that thunks can only move to an evaluated state, never back to being a thunk. (See https://github.com/DeterminateSystems/nix-src/blob/3ccbe8465909bc37e222c8a2c2e047142252a083/src/libexpr/include/nix/expr/value.hh#L440-L457.)

Making thunks retry-able would require a rethink not just with respect to how values are safely updated, but also the desired semantics (e.g. if a bunch of threads are waiting for a thunk that fails with a retry-able error, should those threads immediately retry?).

@roberth
Copy link
Member

roberth commented Sep 9, 2025

The change away from the current retryable behavior is a real blocker for me. I've tested it and it sends NixOps4 into an infinite loop, as unfortunately expected.

Is that assumption really required for correct operation? i.e. what part of the synchronization breaks?

It might be ok for applications that need retrying to disable multithreading, although ideally of course we shouldn't have to work around it like that.

if a bunch of threads are waiting for a thunk that fails with a retry-able error, should those threads immediately retry?

Depends on the granularity of blocking. If it's fine grained, we don't have to worry about work duplication as much, and the remaining failing bit of work is cheap.
I think the trade-off is like this:

  • wait -> read(fail) -> retry is probably easier to reason about for the application, as an evaluation that overlaps in time will never produce a stale outcome (however, taking that into account with an extra retry seems easy enough)
  • wait -> read(fail) -> fail will have better performance, especially when the amount of work to reach the error is somewhat significant and it's unlikely that it will succeed right after (as is the case for NixOps4)

Another thing that can help with performance is to save the function bodies as thunks into the recoveryValue while unwinding the stack. We can still assume purity, so that would be ok to do, semantically, but it's also a form of incremental mutation. Just a bit different.

Anyway, either choice seems workable for now.

edolstra and others added 5 commits September 10, 2025 12:49
In the multithreaded evaluator, it's possible for multiple threads to
wait on the same thunk. If evaluation of the thunk results in an
exception, the waiting threads shouldn't try to re-force the thunk.
Instead, they should rethrow the same exception, without duplicating
any work.

Therefore, there is now a new value type `tFailed` that stores an
std::exception_ptr. If evaluation of a thunk/app results in an
exception, `forceValue()` overwrites the value with a `tFailed`. If
`forceValue()` encounters a `tFailed`, it rethrows the exception. So
you normally never need to check for failed values (since forcing them
causes a rethrow).
This uses `gc_cleanup` to call the exception_ptr destructor when the
`Failed` is collected.
I don't know exactly how bad it is to deallocate `std::exception_ptr`
without destruction, but I guess it ranges from small leak to UB and crash.
This provides an explicit API for call-fail-retry-succeed evaluation
flows, such as currently used in NixOps4.

An alternative design would simply reset the `Value` to the original
thunk instead of `tFailed` under the condition of catching a
`RecoverableEvalError`.
That is somewhat simpler, but I believe the presence of `tFailed` is
beneficial for possible use in the repl; being able to show the error
sooner, without re-evaluation.

The hasPos method and isEmpty function are required in order to avoid
an include loop.
Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few suggestions to improve the output and/or keep it more stable.
I shouldn't approve my own code (the test and the recoverability that's required for NixOps4), so I'd appreciate review by @edolstra and/or others.

case nThunk:
return WA("a", "thunk");
case nFailed:
return WA("a", "failure");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term failed is useful and distinct internally, but normally, for users, we used to call them errors.

Suggested change
return WA("a", "failure");
return WA("an", "error");

break;

case nFailed:
doc.writeEmptyElement("failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not break this code. The XML stuff is rather legacy anyway.

Suggested change
doc.writeEmptyElement("failed");
// Historically, a tried and then ignored value (e.g. through tryEval) was
// reverted to the original thunk.
doc.writeEmptyElement("unevaluated");

}
break;
case nFailed:
str << "«failed»";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
str << "«failed»";
// Historically, a tried and then ignored value (e.g. through tryEval) was
// reverted to the original thunk.
str << "<CODE>";


void printFailed(Value & v)
{
output << "«failed»";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where this ends up.

tBool,
tNull,
tFloat,
tFailed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xokdvium does this look ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: this could be a thunk instead, pointing to a pseudo-Expr that throws the error.
I don't think that change is required, but it would free up this scarce InternalType entry.

Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This messes up the traces when there's reuse.

Reuse can occur when

  • evaluating the same binding in nix repl
  • referencing the same thing after builtins.tryEval
  • evaluating multiple attributes in the same application, with recovery, like nix flake show, Hydra

Example, nix repl --show-trace:

nix-repl> builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }

error:
       … while calling the 'toJSON' builtin
         at «string»:1:1:
            1| builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }
             | ^

[...]

       error: The option `nixos.configurations.feb.services.postgresql.dataDir' was accessed but has no value defined. Try setting the option.
       
nix-repl> nixosConfigurations.feb.config.services.postgresql.dataDir                            
error:
       … while calling the 'toJSON' builtin
         at «string»:1:1:
            1| builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }
             | ^

Note that the second nix repl prompt did not have the toJSON call. Instead the old trace was stored and mutated to add context that does not apply anymore.

We'll need the exception system to stop mutating traces. A singly linked immutable list of trace items seems most suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c api Nix as a C library with a stable interface documentation with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants