Skip to content

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Aug 14, 2025

Summary

The ProjectFilesWalker discovers all the files to check in a ty project by spawning multiple threads that walk the directory tree.
The walker's output is a Vec or Set of all the Files. The way this works today is that we first collect all file paths (multi threaded)
into a single Vec, and the main thread then interns the paths to their File's.

The problem with this approach is that system_path_to_file is fairly expensive when called for many files because it does a stat call internally.

The ideal solution would move the system_path_to_file call to the walker threads. That would also eliminate the shared Vec that collects all paths. Unfortunately, this isn't possible because &dyn Db isn't Send nor does &dyn Db implement Clone (Salsa does have a fork_db feature but this isn't publicly exposed as of today). We also can't add a fork method to ty_project::Db that returns a cloned Self because the Db trait must be dyn compatible.

The solution I've taken here is to move the stat files into the walker threads, which fixes most of the waterfall effect and is much simpler than exposing fork_db properly in salsa.

I implemented the ideal solution thanks to @ibraheemdev's suggestion to add a dyn_clone!

Fixes astral-sh/ty#970

Test Plan

Before
Screenshot 2025-08-14 at 14 04 46

After

Screenshot 2025-08-14 at 14 04 59

@MichaReiser MichaReiser added performance Potential performance improvement ty Multi-file analysis & type inference labels Aug 14, 2025
Copy link
Contributor

github-actions bot commented Aug 14, 2025

Diagnostic diff on typing conformance tests

Changes were detected when running ty on typing conformance tests
--- old-output.txt	2025-08-14 18:13:06.319161224 +0000
+++ new-output.txt	2025-08-14 18:13:06.387161856 +0000
@@ -1,5 +1,5 @@
 WARN ty is pre-release software and not ready for production use. Expect to encounter bugs, missing features, and fatal errors.
-fatal[panic] Panicked at /home/runner/.cargo/git/checkouts/salsa-e6f3bb7c2a062968/918d35d/src/function/execute.rs:215:25 when checking `/home/runner/work/ruff/ruff/typing/conformance/tests/aliases_typealiastype.py`: `infer_definition_types(Id(409cd)): execute: too many cycle iterations`
+fatal[panic] Panicked at /home/runner/.cargo/git/checkouts/salsa-e6f3bb7c2a062968/918d35d/src/function/execute.rs:215:25 when checking `/home/runner/work/ruff/ruff/typing/conformance/tests/aliases_typealiastype.py`: `infer_definition_types(Id(ac97)): execute: too many cycle iterations`
 _directives_deprecated_library.py:15:31: error[invalid-return-type] Function always implicitly returns `None`, which is not assignable to return type `int`
 _directives_deprecated_library.py:30:26: error[invalid-return-type] Function always implicitly returns `None`, which is not assignable to return type `str`
 _directives_deprecated_library.py:36:41: error[invalid-return-type] Function always implicitly returns `None`, which is not assignable to return type `Self@__add__`

Copy link
Contributor

github-actions bot commented Aug 14, 2025

mypy_primer results

No ecosystem changes detected ✅
No memory usage changes detected ✅

@MichaReiser MichaReiser marked this pull request as ready for review August 14, 2025 12:05
@MichaReiser MichaReiser requested review from BurntSushi and ibraheemdev and removed request for dcreager, carljm and sharkdp August 14, 2025 12:05
@MichaReiser MichaReiser changed the title [ty] Fix waterfall when discovering project files [ty] Speedup project file discovery Aug 14, 2025
Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This makes sense to me.

///
/// This function differs from [`system_path_to_file`] in that it doesn't add a dependency
/// on `File.status` and always returns a `File`, even
/// if `path` doesn't point to a file (e.g. if `path` is a directory).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come? If metadata lets you cheaply query the file type, can you still return a Result here?

I mention this because it seems like we are otherwise careful not to allow a File to be backed by a directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could. I had a versio where I did just that.

I considered it to be fine because Files::try_system also doesn't return a Result but the file directly (even if the path maps to a directory). This where I consider the distinction between system_path_to_file which makes sure it adds a query dependency on status (so that your query re-runs if the status changes) and protects you from getting a File that points to a directory or doesn't exist.

Either way, I'm going to revert all of this thanks to @ibraheemdev's suggestion

paths.push(entry.into_path());
// If this returns `Err`, then the file was deleted between now and when the walk callback was called.
// We can ignore this.
if let Ok(metadata) = system.path_metadata(entry.path()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why you can't use ignore::DirEntry::file_type here, and I think it's because you also need the file permissions? Because directory entries (on Unix) will include the FileType free of charge. But not permissions.

Copy link
Member Author

@MichaReiser MichaReiser Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about using the walk dir entry directly, but we, unfortunately, also need the permissions and, more importantly, the last modified time

Copy link
Member

@ibraheemdev ibraheemdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also can't add a fork method to ty_project::Db that returns a cloned Self because the Db trait must be dyn compatible.

The usual workaround for this is to have a dyn_clone method that returns a Box<dyn Db>, but the solution here also seems perfectly fine.

@MichaReiser
Copy link
Member Author

The usual workaround for this is to have a dyn_clone method that returns a Box, but the solution here also seems perfectly fine.

Oh nice, that's exactly what I was looking for.

@MichaReiser MichaReiser merged commit ce938fe into main Aug 14, 2025
38 checks passed
@MichaReiser MichaReiser deleted the micha/fix-walk-files-waterfall branch August 14, 2025 18:38
dcreager added a commit that referenced this pull request Aug 14, 2025
* main:
  [ty] Add diagnostics for invalid `await` expressions (#19711)
  [ty] Synthesize read-only properties for all declared members on `NamedTuple` classes (#19899)
  [ty] Remove use of `ClassBase::try_from_type` from `super()` machinery (#19902)
  [ty] Speedup project file discovery  (#19913)
  [`pyflakes`] Add secondary annotation showing previous definition (`F811`) (#19900)
  Bump 0.12.9 (#19917)
  [ty] support `kw_only=True` for `dataclass()` and `field()` (#19677)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential performance improvement ty Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Waterfall effect in ProjectFilesWalker::collect_vec
3 participants