-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
[CI] Speed up model unit tests in CI #24253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention to still run these tests on each PR (unlike the Extended tests which are nightly only) but in parallel with the faster model tests?
Yes, that is correct @DarkLight1337 |
Signed-off-by: Andrew Feldman <[email protected]>
More specifically @DarkLight1337 the goal is to speed up the regressions that run for each PR in the typical case, by (1) recategorizing certain slow model tests so that they only run in under specific atypical conditions (i.e. modifications to model code), and (2) sharding model tests to exploit parallelism at the granularity of CI test scheduling Also realize I forgot to mention, the PR is addressing this issue |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
I worry about aggressively sharding fast tests due to the fixed cost we pay to start an instance. Last time I checked it was 3-5min init. Has that time gone down? |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Per our DMs, I will limit the number of parallelized tests and lower the parallelism to 2x. At least until init time is reduced. |
Signed-off-by: Andrew Feldman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm
Note that the PR to fix the issue is WIP, see #24750 |
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: bbartels <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Purpose
Addresses #23670 (not sure as of yet if there will be additional PRs to fix this issue)
This PR speeds up the regression tests which run for each PR (as opposed to for example the nightly tests, which are not impacted by this PR)
No impact on
buildkite/fastcheck/pr
unit testsImpact on
buildkite/ci/pr
unit tests: lower the typical execution time of the following regression test categories in CI-Basic Models
main
Language Models [standard] - these tests are grouped together by the
core_model
pytest markslow_test
pytest markcore_model
tests asslow_test
core_model
andslow_test
pytest marks), which only runs in the special case where model source is modified. Shard these extra tests 4x to lower test timeLanguage models [hybrid]
Trigger conditions for unit tests
vllm/
modification -> Basic Models Tests, Language Models [standard] Tests, Language Models [hybrid] TestsAny vLLM model source code modification -> Extra Model Initialization Tests, Language Modes [standard][extra] (in addition to all of the tests which run when
vllm/
is modified)Test Plan
buildkite/ci/pr
tests are enabled on this PR with/ready
, we will see that the new unit tests are triggered as expectedTest Result
For test runtimes, see summary about and details here https://buildkite.com/vllm/fastcheck/builds/41168#01993638-97b7-47b5-bdaf-2c0f70e0c16c
TODO: validate that tests are triggered as expected
Note: there appear to be two regression tests failing, which I am pretty sure are not caused by this PR: