-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[Core] Shared memory based object store for Multimodal data caching and IPC #20452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @dongluw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant architectural enhancement by implementing a shared memory-based object store for multimodal data caching and inter-process communication (IPC). This new mechanism allows for more efficient handling and transfer of large multimodal inputs, such as images, across different processes in a distributed VLLM setup. By leveraging shared memory, the system can avoid costly data serialization and deserialization, and memory copying, thereby improving performance and reducing latency for multimodal model inference. The change also provides configurability to switch between shared memory and LRU caching, with shared memory becoming the default.
Highlights
- Shared Memory Object Store: Implemented
SingleWriterShmRingBuffer
andSingleWriterShmObjectStorage
to provide an efficient, shared-memory based key-value store. This enables one process to write data (e.g., preprocessed multimodal inputs) and multiple processes to read it without costly data copying, significantly improving IPC performance. - Configurable Multimodal Cache: Introduced a new configuration option,
mm_preprocessor_cache_type
, withinMultiModalConfig
. This allows users to select between the new shared memory (SHM) cache and the existing LRU cache for multimodal preprocessor outputs, with SHM now being the default choice. - Distributed Multimodal Data Handling: Integrated the shared memory object store into the distributed VLLM executor. The main process (P0) now writes preprocessed multimodal inputs to shared memory, and worker processes (P1+) retrieve these inputs directly from SHM, reducing inter-process communication overhead and latency for multimodal model inference.
- New Environment Variables: Added
VLLM_OBJECT_STORAGE_MAX_OBJECT_SIZE_MB
,VLLM_OBJECT_STORAGE_SHM_BUFFER_SIZE_MB
, andVLLM_OBJECT_STORAGE_SHM_BUFFER_NAME
tovllm/envs.py
. These environment variables provide fine-grained control over the shared memory object store's maximum object size, total buffer size, and unique name. - Comprehensive Testing: Included new test files (
test_shm_object_overhead.py
,test_shm_buffer.py
,test_shm_storage.py
) to thoroughly validate the functionality, robustness, and performance of the shared memory ring buffer and object storage implementations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a shared memory-based object store for caching multimodal data, which is a significant and valuable addition for improving performance in multi-process environments. Several critical and high-severity issues related to correctness, configuration defaults, and design choices were identified and must be addressed.
fed61b7
to
5cbecdb
Compare
08ffad6
to
4e6e481
Compare
340066c
to
a1880ad
Compare
Can you merge from main to fix pre-commit? |
Signed-off-by: donglu <[email protected]> Address comments Signed-off-by: donglu <[email protected]>
506ad2b
to
2a56e7c
Compare
Signed-off-by: donglu <[email protected]>
bc lint check complains it's not able to see three files newly added by the pr, not sure how to fix it, @DarkLight1337 any thoughts?
|
Let me just force-merge this |
@DarkLight1337 this broke pre-commit on main
|
Working on a fix |
…ect#24528 Signed-off-by: DarkLight1337 <[email protected]>
Opened #24754 |
Signed-off-by: DarkLight1337 <[email protected]>
…nd IPC (vllm-project#20452) Signed-off-by: donglu <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]>
…nd IPC (vllm-project#20452) Signed-off-by: donglu <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]>
…nd IPC (vllm-project#20452) Signed-off-by: donglu <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: bbartels <[email protected]>
…nd IPC (vllm-project#20452) Signed-off-by: donglu <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: bruceszchen <[email protected]>
…nd IPC (vllm-project#20452) Signed-off-by: donglu <[email protected]>
…ect#24548 (vllm-project#24754) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: bruceszchen <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
For improving multimodal data IPC and caching, CLOSE #19702
which will increase image token prefill efficiency especially in high cache hit rate + high TP cases.
Benchmark on
CohereLabs/command-a-vision-07-2025
sees 11.5% token prefill improvements without image cache hits and 70% improvements with 100% image cache hits, the gain can vary depending on model, TP, number of images, image size, text length.Major changes
Shared Memory Ring Buffer
Introduced a shared memory–based ring buffer supporting
allocate
andaccess
operations, designed for single-writer (broadcast) scenarios.Object Store on Ring Buffer
Built an object store on top of the shared memory ring buffer, supporting
put
andget
operations.Improved IPC for TP > 1
MultiModalKwargs is now written to the object store before broadcasting in the engine process and retrieved by worker processes, improving IPC efficiency:
3.1. Reduces socket usage in MQ overflow cases by offloading mm data to the object store.
3.2 Replaces pickle with
MsgpackDecoder(MultiModalKwargs)
to reduce serialization overhead and memory copies.Shared Memory Object Store as MM Cache
Added an option to use the shared memory object store instead of the LRU cache to improve multimodal cache performance:
4.1 In TP > 1 setups,
MultiModalKwargs
is written once by p0 and read directly by worker processes, avoiding redundant transmission through the engine process.4.2 IPC for
MultiModalKwargs
between p0 and the engine process now uses shared memory instead of sockets.4.3 Reduces memory footprint by avoiding duplicate storage of MultiModalKwargs in p0 and the engine process.
Trade-offs
5.1 The object store is not a full replacement for MQ, as it requires the reader to know the address provided by the writer, so we still need MQ to broadcast the address.
5.2 The current ring buffer is FIFO-based, which may not be ideal compared to LRU in certain scenarios.
TODO
Test Plan
Correctness verification
mm_processor_cache_type="shm",
toload_command_a_vision
inexamples/offline_inference/vision_language_multi_image.py
python3 examples/offline_inference/vision_language_multi_image.py --model-type command_a_vision
Model generates sensible outputs.
start server
add
print(f"Request {i} {outputs[i].generated_text}")
undercalculate_metrics
method inbenchmarks/benchmark_serving.py
run benchmark script with 128 output twice
or using vllm bench
Check the output of two runs are (almost) identical
Efficiency verification
run benchmark script with 1 output twice
Compare same metrics with main branch, we should see speedup especially in the second run
Test Result
online correctness check
Tested
CohereLabs/command-a-vision-07-2025
TP4 andCohereLabs/aya-vision-8b
TP1Output for both runs (with and without cache hits) are almost identical
Example outputs
1st run
2nd run, high cache hit %
Benchmark results
improve image token prefill throughput especially in high cache hit rate + high TP cases
For command-a-vision-07-2025 TP4
-- Total Token throughput (tok/s): 581.34 -> 648.22 tok/s, 11.5% improvement
-- Mean TTFT (ms): 3898.98 -> 3491.15, 10.5% less
-- Total Token throughput (tok/s): 2894.03 -> 4917.57 tok/s, 69.9% improvement
-- Mean TTFT (ms): 790.18 -> 470.60, 40.5% less
For TP1, shared memory cache is not optimal since there's no IPC overhead
CohereLabs/command-a-vision-07-2025
TP4Results with this pr
Results on main
(Optional) Documentation Update