Skip to content

Conversation

bbartels
Copy link
Contributor

@bbartels bbartels commented Sep 12, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@bbartels bbartels requested a review from hmellor as a code owner September 12, 2025 15:09
@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Sep 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the installation of the nixl dependency by removing the custom source compilation script (install_nixl.sh) and adding nixl as a PyPI dependency in requirements/common.txt. This simplifies the build process. However, I have a critical concern regarding whether the nixl pip package can fully replace the complex installation logic from the removed script, which included handling dependencies like UCX and gdrcopy with hardware-specific optimizations. My review includes a detailed comment on this potential issue.

@cjackal
Copy link
Contributor

cjackal commented Sep 12, 2025

I think Gemini is right in the sense that install_nixl.sh installs libgdrapi which is a necessary lib for nixl to use gdrcopy. We may need to recover the gdrcopy install part out of install_nixl.sh.

@bbartels
Copy link
Contributor Author

@cjackal Do you think it suffice if we are just installing prebuild .debs? https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu22_04/x64/

@bbartels
Copy link
Contributor Author

@cjackal
Copy link
Contributor

cjackal commented Sep 12, 2025

@cjackal Do you think it suffice if we are just installing prebuild .debs? https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/ubuntu22_04/x64/

I haven't tested but it likely is working, gdrcopy is not so sensitive to release versions. Yeah it works, I have tested the nixl P/D tutorial code.

But in terms of maintenance cost installing a pre-compiled deb package which depends on a specific CUDA version is not so favorable I think. Say vllm HEAD is using CUDA runtime 12.9 by default and 12.6 as optional in CI, but pre-compiled gdrcopy packages in nvidia compute repo has neither CUDA 12.9 nor 12.6 variants. Maybe a viewpoint from vllm maintainer is needed, cc @DarkLight1337

@DarkLight1337
Copy link
Member

cc @youkaichao @simon-mo

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@hmellor hmellor requested a review from NickLucche September 15, 2025 08:30
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only left a few minor comments.

@@ -105,6 +105,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION} \
&& python3 --version && python3 -m pip --version


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it clearer with the extra lines?

@@ -191,11 +190,9 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok

### Setup Steps

1. **Install KV Connector**: Install NIXL using the [installation script](gh-file:tools/install_nixl.sh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we mention one should install kv_connectors.txt requirements somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I'll add that :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can replace kf_connectors.txt entirely be including these as an extra in setup.py?

setup(
    ...
    extras_require={
        ...
        "kv-connector": ["lmcache", "nixl>=0.5.1"],
        ...
)

Then you just install vLLM using uv pip install vllm[kv-connector] or uv pip install -e .[kv-connector]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do, the only issue is that you still need to do the manual install of libgdrapi

Copy link
Collaborator

@NickLucche NickLucche Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am expecting this list to grow but that is an option.
Also, we may need to separate those into separate extras eg it will be common to avoid nixl if using nccl, but still want lmcache as additional connector for offloading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's stick with txt files as that's what we have for everything. In future it would be nice to use the extras_requires feature properly for all cases like this

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@mergify mergify bot added multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models speculative-decoding v1 tpu Related to Google TPUs labels Sep 15, 2025
@hmellor
Copy link
Member

hmellor commented Sep 15, 2025

Looks like the git history is broken and everyone got pinged. @bbartels could you please make a new PR?

@hmellor hmellor closed this Sep 15, 2025
@bbartels
Copy link
Contributor Author

Yep, sorry tried to do the signoff and royally failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models speculative-decoding tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.