Introduce a "failed" value type #13930

edolstra · 2025-09-07T09:52:18Z

Motivation

In the multithreaded evaluator, it's possible for multiple threads to wait on the same thunk. If evaluation of the thunk results in an exception, the waiting threads shouldn't try to re-force the thunk. Instead, they should rethrow the same exception, without duplicating any work.

Therefore, there is now a new value type tFailed that stores an std::exception_ptr. If evaluation of a thunk/app results in an exception, forceValue() overwrites the value with a tFailed. If forceValue() encounters a tFailed, it rethrows the exception. So you normally never need to check for failed values (since forcing them causes a rethrow).

Context

Extracted from DeterminateSystems#125.

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

roberth · 2025-09-09T12:26:29Z

Things done:

Restore the tests to their original master expected output, which was better
Move cold path code to non-inline methods
Breaking change and resolution regarding recoverability
- Add recoverability to C API for NixOps4 and potentially other consumers
- Release note
Add test

edolstra · 2025-09-09T12:40:32Z

I'm not so keen on adding retry-able thunks to this PR, since parallel eval was written under the assumption that thunks can only move to an evaluated state, never back to being a thunk. (See https://github.com/DeterminateSystems/nix-src/blob/3ccbe8465909bc37e222c8a2c2e047142252a083/src/libexpr/include/nix/expr/value.hh#L440-L457.)

Making thunks retry-able would require a rethink not just with respect to how values are safely updated, but also the desired semantics (e.g. if a bunch of threads are waiting for a thunk that fails with a retry-able error, should those threads immediately retry?).

roberth · 2025-09-09T13:44:35Z

The change away from the current retryable behavior is a real blocker for me. I've tested it and it sends NixOps4 into an infinite loop, as unfortunately expected.

Is that assumption really required for correct operation? i.e. what part of the synchronization breaks?

It might be ok for applications that need retrying to disable multithreading, although ideally of course we shouldn't have to work around it like that.

if a bunch of threads are waiting for a thunk that fails with a retry-able error, should those threads immediately retry?

Depends on the granularity of blocking. If it's fine grained, we don't have to worry about work duplication as much, and the remaining failing bit of work is cheap.
I think the trade-off is like this:

wait -> read(fail) -> retry is probably easier to reason about for the application, as an evaluation that overlaps in time will never produce a stale outcome (however, taking that into account with an extra retry seems easy enough)
wait -> read(fail) -> fail will have better performance, especially when the amount of work to reach the error is somewhat significant and it's unlikely that it will succeed right after (as is the case for NixOps4)

Another thing that can help with performance is to save the function bodies as thunks into the recoveryValue while unwinding the stack. We can still assume purity, so that would be ok to do, semantically, but it's also a form of incremental mutation. Just a bit different.

Anyway, either choice seems workable for now.

In the multithreaded evaluator, it's possible for multiple threads to wait on the same thunk. If evaluation of the thunk results in an exception, the waiting threads shouldn't try to re-force the thunk. Instead, they should rethrow the same exception, without duplicating any work. Therefore, there is now a new value type `tFailed` that stores an std::exception_ptr. If evaluation of a thunk/app results in an exception, `forceValue()` overwrites the value with a `tFailed`. If `forceValue()` encounters a `tFailed`, it rethrows the exception. So you normally never need to check for failed values (since forcing them causes a rethrow).

This uses `gc_cleanup` to call the exception_ptr destructor when the `Failed` is collected. I don't know exactly how bad it is to deallocate `std::exception_ptr` without destruction, but I guess it ranges from small leak to UB and crash.

This provides an explicit API for call-fail-retry-succeed evaluation flows, such as currently used in NixOps4. An alternative design would simply reset the `Value` to the original thunk instead of `tFailed` under the condition of catching a `RecoverableEvalError`. That is somewhat simpler, but I believe the presence of `tFailed` is beneficial for possible use in the repl; being able to show the error sooner, without re-evaluation. The hasPos method and isEmpty function are required in order to avoid an include loop.

roberth

Have a few suggestions to improve the output and/or keep it more stable.
I shouldn't approve my own code (the test and the recoverability that's required for NixOps4), so I'd appreciate review by @edolstra and/or others.

roberth · 2025-09-10T10:15:59Z

src/libexpr/eval.cc

    case nThunk:
        return WA("a", "thunk");
+    case nFailed:
+        return WA("a", "failure");


I think the term failed is useful and distinct internally, but normally, for users, we used to call them errors.

Suggested change

return WA("a", "failure");

return WA("an", "error");

roberth · 2025-09-10T11:12:25Z

src/libexpr/value-to-xml.cc

+        break;
+
+    case nFailed:
+        doc.writeEmptyElement("failed");


Let's not break this code. The XML stuff is rather legacy anyway.

Suggested change

doc.writeEmptyElement("failed");

// Historically, a tried and then ignored value (e.g. through tryEval) was

// reverted to the original thunk.

doc.writeEmptyElement("unevaluated");

roberth · 2025-09-10T11:14:03Z

src/libexpr/print-ambiguous.cc

        }
        break;
+    case nFailed:
+        str << "«failed»";


Suggested change

str << "«failed»";

// Historically, a tried and then ignored value (e.g. through tryEval) was

// reverted to the original thunk.

str << "<CODE>";

roberth · 2025-09-10T11:16:42Z

src/libexpr/print.cc


+    void printFailed(Value & v)
+    {
+        output << "«failed»";


Not sure where this ends up.

roberth · 2025-09-10T11:20:54Z

src/libexpr/include/nix/expr/value.hh

    tBool,
    tNull,
    tFloat,
+    tFailed,


@xokdvium does this look ok?

Thought: this could be a thunk instead, pointing to a pseudo-Expr that throws the error.
I don't think that change is required, but it would free up this scarce InternalType entry.

roberth

This messes up the traces when there's reuse.

Reuse can occur when

evaluating the same binding in nix repl
referencing the same thing after builtins.tryEval
evaluating multiple attributes in the same application, with recovery, like nix flake show, Hydra

Example, nix repl --show-trace:

nix-repl> builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }

error:
       … while calling the 'toJSON' builtin
         at «string»:1:1:
            1| builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }
             | ^

[...]

       error: The option `nixos.configurations.feb.services.postgresql.dataDir' was accessed but has no value defined. Try setting the option.
       
nix-repl> nixosConfigurations.feb.config.services.postgresql.dataDir                            
error:
       … while calling the 'toJSON' builtin
         at «string»:1:1:
            1| builtins.toJSON { foo = nixosConfigurations.feb.config.services.postgresql.dataDir; }
             | ^

Note that the second nix repl prompt did not have the toJSON call. Instead the old trace was stored and mutated to add context that does not apply anymore.

We'll need the exception system to stop mutating traces. A singly linked immutable list of trace items seems most suitable.

edolstra requested a review from roberth as a code owner September 7, 2025 09:52

github-actions bot added the c api Nix as a C library with a stable interface label Sep 7, 2025

edolstra force-pushed the failed-values branch from 48d268b to 6155a84 Compare September 7, 2025 10:21

github-actions bot added with-tests Issues related to testing. PRs with tests have some priority documentation labels Sep 7, 2025

roberth self-assigned this Sep 9, 2025

roberth force-pushed the failed-values branch from f635170 to 51cee6d Compare September 9, 2025 11:48

roberth force-pushed the failed-values branch from 51cee6d to 1ba3ad7 Compare September 9, 2025 17:26

edolstra and others added 5 commits September 10, 2025 12:49

Add tests/f/lang/eval-okay-tryeval-failed-thunk-reeval

6de4db1

libexpr: Call destructor on exception_ptr

9aa3cc9

This uses `gc_cleanup` to call the exception_ptr destructor when the `Failed` is collected. I don't know exactly how bad it is to deallocate `std::exception_ptr` without destruction, but I guess it ranges from small leak to UB and crash.

Add rl-next/c-api-recoverable-errors

110a9ef

roberth force-pushed the failed-values branch from 1ba3ad7 to 110a9ef Compare September 10, 2025 11:09

roberth reviewed Sep 10, 2025

View reviewed changes

roberth requested changes Sep 10, 2025

View reviewed changes

roberth removed their assignment Sep 10, 2025

This was referenced Sep 18, 2025

Revert "Merge pull request #13938 from NixOS/import-thunk" #14013

Merged

Ensure that files are parsed/evaluated only once (2nd attempt) #14040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Introduce a "failed" value type #13930

Introduce a "failed" value type #13930

Uh oh!

edolstra commented Sep 7, 2025

Uh oh!

roberth commented Sep 9, 2025 •

edited

Loading

Uh oh!

edolstra commented Sep 9, 2025

Uh oh!

roberth commented Sep 9, 2025

Uh oh!

roberth left a comment

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth Sep 10, 2025

Uh oh!

roberth left a comment

Uh oh!

Uh oh!

-        doc.writeEmptyElement("failed");
+        // Historically, a tried and then ignored value (e.g. through tryEval) was
+        // reverted to the original thunk.
+        doc.writeEmptyElement("unevaluated");

Uh oh!

Introduce a "failed" value type #13930

Are you sure you want to change the base?

Introduce a "failed" value type #13930

Uh oh!

Conversation

edolstra commented Sep 7, 2025

Motivation

Context

Uh oh!

roberth commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edolstra commented Sep 9, 2025

Uh oh!

roberth commented Sep 9, 2025

Uh oh!

roberth left a comment

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

roberth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roberth commented Sep 9, 2025 •

edited

Loading