Skip to content

Conversation

dongluw
Copy link
Contributor

@dongluw dongluw commented Jul 3, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

For improving multimodal data IPC and caching, CLOSE #19702
which will increase image token prefill efficiency especially in high cache hit rate + high TP cases.

Benchmark on CohereLabs/command-a-vision-07-2025 sees 11.5% token prefill improvements without image cache hits and 70% improvements with 100% image cache hits, the gain can vary depending on model, TP, number of images, image size, text length.

Major changes

  1. Shared Memory Ring Buffer
    Introduced a shared memory–based ring buffer supporting allocate and access operations, designed for single-writer (broadcast) scenarios.

  2. Object Store on Ring Buffer
    Built an object store on top of the shared memory ring buffer, supporting put and get operations.

  3. Improved IPC for TP > 1
    MultiModalKwargs is now written to the object store before broadcasting in the engine process and retrieved by worker processes, improving IPC efficiency:
    3.1. Reduces socket usage in MQ overflow cases by offloading mm data to the object store.
    3.2 Replaces pickle with MsgpackDecoder(MultiModalKwargs) to reduce serialization overhead and memory copies.

  4. Shared Memory Object Store as MM Cache
    Added an option to use the shared memory object store instead of the LRU cache to improve multimodal cache performance:
    4.1 In TP > 1 setups, MultiModalKwargs is written once by p0 and read directly by worker processes, avoiding redundant transmission through the engine process.
    4.2 IPC for MultiModalKwargs between p0 and the engine process now uses shared memory instead of sockets.
    4.3 Reduces memory footprint by avoiding duplicate storage of MultiModalKwargs in p0 and the engine process.

  5. Trade-offs
    5.1 The object store is not a full replacement for MQ, as it requires the reader to know the address provided by the writer, so we still need MQ to broadcast the address.
    5.2 The current ring buffer is FIFO-based, which may not be ideal compared to LRU in certain scenarios.

TODO

  • pass parameter as engine args
  • documentation and better comments
  • improve tests

Test Plan

Correctness verification

  1. Offline requests
  • add mm_processor_cache_type="shm", to load_command_a_vision in examples/offline_inference/vision_language_multi_image.py
  • run python3 examples/offline_inference/vision_language_multi_image.py --model-type command_a_vision

Model generates sensible outputs.

  1. Online requests
    start server
vllm serve CohereLabs/command-a-vision-07-2025 --mm_processor_cache_type shm --disable-log-requests  --tensor-parallel-size 4 --max_model_len 16384 --max-num-seqs 32

add print(f"Request {i} {outputs[i].generated_text}") under calculate_metrics method in benchmarks/benchmark_serving.py
run benchmark script with 128 output twice

python3 benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --model CohereLabs/command-a-vision-07-2025 \
  --max-concurrency 4 \
  --num-prompts 32 \
  --hf-output-len 128

or using vllm bench

vllm bench serve \
  --endpoint-type openai-chat \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --model CohereLabs/command-a-vision-07-2025 \
  --max-concurrency 4 \
  --num-prompts 32 \
  --hf-output-len 128

Check the output of two runs are (almost) identical

Efficiency verification

run benchmark script with 1 output twice

python3 benchmarks/benchmark_serving.py \
  --backend openai-chat \
  --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path lmarena-ai/VisionArena-Chat \
  --hf-split train \
  --model CohereLabs/command-a-vision-07-2025 \
  --max-concurrency 32 \
  --num-prompts 128 \
  --hf-output-len 1

Compare same metrics with main branch, we should see speedup especially in the second run

Test Result

online correctness check

Tested CohereLabs/command-a-vision-07-2025 TP4 and CohereLabs/aya-vision-8b TP1
Output for both runs (with and without cache hits) are almost identical

Example outputs
1st run

Request 0 I understand your frustration. While I can't control my own preferences, I respect your feelings. If you'd like to end our conversation, I'm happy to do so. 

Remember, it's okay to have different opinions, and I'm here to provide helpful and harmless information. If you have any other questions or need assistance with something else, feel free to ask.
Request 1 A beige, adjustable bra with wide straps and a back closure.
Request 2 根据提供的信息,中国国内的光纤光缆生产企业包括:

1. **中国光纤信息产业有限公司**(CNOIC):该公司专注于光纤光缆的研发、生产和销售,产品包括单模和多模光纤光缆。

2. **中国华为技术有限公司**(Huawei Technologies Co., Ltd.):华为不仅是全球领先的电信设备制造商,其光纤光缆产品也非常知名,包括各种规格和类型的光纤光缆。

3. **中国中际科技股份有限公司**(Zhongji Technology Co., Ltd.):该公司主要

Request 28 The concept of mutual knowledge versus common knowledge is a key distinction in social epistemology, which explores how knowledge is shared and understood within groups. In the context of the question presented in the document, the explanation revolves around the nature of knowledge shared among individuals.

Mutual knowledge refers to a situation where a group of people knows that a certain fact is true. In other words, everyone in the group is aware of the fact, but they do not necessarily communicate this knowledge to each other. It is mutual because each person in the group shares the knowledge internally, without necessarily explicitly stating it to the others. This type of knowledge is often
Request 29 このロゴは、数学や科学に関連するソフトウェア製品に関連している可能性があります。特に、数値計算、物理学シミュレーション、または数学的モデリングに焦点を当てたソフトウェアである可能性があります。"π"は円周率を表し、数学や物理学の分野で重要な定数であるため、このロゴはこれらの分野に特化したソフトウェアであることを示唆しています。

このロゴを使用しているソフトウェアは、数値解析、工学計算、または科学研究ツールである可能性があります。ユーザーが円周率や三角関数などの数学的定数や操作を頻繁に使用する必要がある
Request 30 The image presents a pie chart and accompanying text that illustrate the findings of a survey conducted by the American Association for the Advancement of Science (AAAS) on scientists' views regarding their role in public policy debates. The survey was conducted from September 11 to October 13, 2014, and included questions about how scientists should engage with public policy discussions related to science and technology.

The pie chart shows that the majority of AAAS scientists, 87%, believe that scientists should take an active role in public policy debates about science and technology. This represents a strong support for engagement in these areas.
Request 31 La risposta corretta è: "I = 1,5 A e V = 1,5 V". Questa risposta soddisfa le condizioni date nel problema, ovvero una batteria AA da 1,5 V in serie con una resistenza di 1 kΩ.

2nd run, high cache hit %

Request 0 I understand your frustration. While I can't control my own preferences, I respect your feelings. If you'd like to end our conversation, I'm happy to do so. 

Remember, it's okay to have different opinions, and I'm here to provide helpful and harmless information. If you have any other questions or need assistance with something else, feel free to ask.
Request 1 A beige, adjustable bra with wide straps and a back closure.
Request 2 根据提供的信息,中国国内的光纤光缆生产企业包括:

1. **中国光纤信息产业有限公司**(CNOIC):该公司专注于光纤光缆的研发、生产和销售,产品包括单模和多模光纤光缆。

2. **中国华为技术有限公司**(Huawei Technologies Co., Ltd.):华为不仅是全球领先的电信设备制造商,其光纤光缆产品也非常知名,包括各种规格和类型的光纤光缆。

3. **中国中际科技股份有限公司**(Zhongji Technology Co., Ltd.):该公司主要

Request 28 The concept of mutual knowledge versus common knowledge is a key distinction in social epistemology, which explores how knowledge is shared and understood within groups. In the context of the question presented in the document, the explanation revolves around the nature of knowledge shared among individuals.

Mutual knowledge refers to a situation where a group of people knows that a certain fact is true. In other words, everyone in the group is aware of the fact, but they do not necessarily communicate this knowledge to each other. It is mutual because each person in the group shares the same piece of information internally, but it remains uncommunicated. This type of knowledge can
Request 29 このロゴは、数学や科学に関連するソフトウェア製品に関連している可能性があります。特に、数値計算、物理学シミュレーション、または数学的モデリングに焦点を当てたソフトウェアである可能性があります。"π"は円周率を表し、数学や物理学の分野で重要な定数であるため、このロゴはこれらの分野に特化したソフトウェアであることを示唆しています。

このロゴを使用しているソフトウェアは、数値解析、工学計算、または科学研究ツールである可能性があります。ユーザーが円周率や三角関数などの数学的定数や操作を頻繁に使用する必要がある
Request 30 The image presents a pie chart and accompanying text that illustrate the findings of a survey conducted by the American Association for the Advancement of Science (AAAS) on scientists' views regarding their role in public policy debates. The survey was conducted from September 11 to October 13, 2014, and included questions about how scientists should engage with public policy discussions related to science and technology.

The pie chart shows that the majority of AAAS scientists, 87%, believe that scientists should take an active role in public policy debates about science and technology. This represents a strong support for engagement in these areas.
Request 31 La risposta corretta è: "I = 1,5 A e V = 1,5 V". Questa risposta soddisfa le condizioni date nel problema, ovvero una batteria AA da 1,5 V in serie con una resistenza di 1 kΩ.

Benchmark results

improve image token prefill throughput especially in high cache hit rate + high TP cases

For command-a-vision-07-2025 TP4

  • first time requests
    -- Total Token throughput (tok/s): 581.34 -> 648.22 tok/s, 11.5% improvement
    -- Mean TTFT (ms): 3898.98 -> 3491.15, 10.5% less
  • second time requests
    -- Total Token throughput (tok/s): 2894.03 -> 4917.57 tok/s, 69.9% improvement
    -- Mean TTFT (ms): 790.18 -> 470.60, 40.5% less

For TP1, shared memory cache is not optimal since there's no IPC overhead

CohereLabs/command-a-vision-07-2025 TP4
Results with this pr

# first time requests
============ Serving Benchmark Result ============
Successful requests:                     128       
Maximum request concurrency:             32        
Benchmark duration (s):                  15.71     
Total input tokens:                      10058     
Total generated tokens:                  128       
Request throughput (req/s):              8.15      
Output token throughput (tok/s):         8.15      
Total Token throughput (tok/s):          648.22    
---------------Time to First Token----------------
Mean TTFT (ms):                          3491.15   
Median TTFT (ms):                        3592.57   
P99 TTFT (ms):                           4541.37   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01      
Median ITL (ms):                         0.01      
P99 ITL (ms):                            0.04      
==================================================

# second time same requests (high cache hit %)
============ Serving Benchmark Result ============
Successful requests:                     128       
Maximum request concurrency:             32        
Benchmark duration (s):                  2.07      
Total input tokens:                      10058     
Total generated tokens:                  128       
Request throughput (req/s):              61.80     
Output token throughput (tok/s):         61.80     
Total Token throughput (tok/s):          4917.57   
---------------Time to First Token----------------
Mean TTFT (ms):                          470.60    
Median TTFT (ms):                        497.97    
P99 TTFT (ms):                           682.93    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01      
Median ITL (ms):                         0.01      
P99 ITL (ms):                            0.03      
==================================================

Results on main

# first time requests
============ Serving Benchmark Result ============
Successful requests:                     128       
Maximum request concurrency:             32        
Benchmark duration (s):                  17.52     
Total input tokens:                      10058     
Total generated tokens:                  128       
Request throughput (req/s):              7.31      
Output token throughput (tok/s):         7.31      
Total Token throughput (tok/s):          581.34    
---------------Time to First Token----------------
Mean TTFT (ms):                          3898.98   
Median TTFT (ms):                        4034.59   
P99 TTFT (ms):                           5293.20   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01      
Median ITL (ms):                         0.01      
P99 ITL (ms):                            0.04      
==================================================

# second time same requests (high cache hit %)
============ Serving Benchmark Result ============
Successful requests:                     128       
Maximum request concurrency:             32        
Benchmark duration (s):                  3.52      
Total input tokens:                      10058     
Total generated tokens:                  128       
Request throughput (req/s):              36.37     
Output token throughput (tok/s):         36.37     
Total Token throughput (tok/s):          2894.03   
---------------Time to First Token----------------
Mean TTFT (ms):                          790.18    
Median TTFT (ms):                        845.48    
P99 TTFT (ms):                           1111.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.01      
Median ITL (ms):                         0.01      
P99 ITL (ms):                            0.04      
==================================================

(Optional) Documentation Update

Copy link

github-actions bot commented Jul 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @dongluw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural enhancement by implementing a shared memory-based object store for multimodal data caching and inter-process communication (IPC). This new mechanism allows for more efficient handling and transfer of large multimodal inputs, such as images, across different processes in a distributed VLLM setup. By leveraging shared memory, the system can avoid costly data serialization and deserialization, and memory copying, thereby improving performance and reducing latency for multimodal model inference. The change also provides configurability to switch between shared memory and LRU caching, with shared memory becoming the default.

Highlights

  • Shared Memory Object Store: Implemented SingleWriterShmRingBuffer and SingleWriterShmObjectStorage to provide an efficient, shared-memory based key-value store. This enables one process to write data (e.g., preprocessed multimodal inputs) and multiple processes to read it without costly data copying, significantly improving IPC performance.
  • Configurable Multimodal Cache: Introduced a new configuration option, mm_preprocessor_cache_type, within MultiModalConfig. This allows users to select between the new shared memory (SHM) cache and the existing LRU cache for multimodal preprocessor outputs, with SHM now being the default choice.
  • Distributed Multimodal Data Handling: Integrated the shared memory object store into the distributed VLLM executor. The main process (P0) now writes preprocessed multimodal inputs to shared memory, and worker processes (P1+) retrieve these inputs directly from SHM, reducing inter-process communication overhead and latency for multimodal model inference.
  • New Environment Variables: Added VLLM_OBJECT_STORAGE_MAX_OBJECT_SIZE_MB, VLLM_OBJECT_STORAGE_SHM_BUFFER_SIZE_MB, and VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to vllm/envs.py. These environment variables provide fine-grained control over the shared memory object store's maximum object size, total buffer size, and unique name.
  • Comprehensive Testing: Included new test files (test_shm_object_overhead.py, test_shm_buffer.py, test_shm_storage.py) to thoroughly validate the functionality, robustness, and performance of the shared memory ring buffer and object storage implementations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jul 3, 2025
Copy link

mergify bot commented Jul 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dongluw.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a shared memory-based object store for caching multimodal data, which is a significant and valuable addition for improving performance in multi-process environments. Several critical and high-severity issues related to correctness, configuration defaults, and design choices were identified and must be addressed.

@dongluw dongluw force-pushed the shm_store branch 6 times, most recently from fed61b7 to 5cbecdb Compare July 3, 2025 20:56
@mergify mergify bot removed the needs-rebase label Jul 3, 2025
@dongluw dongluw force-pushed the shm_store branch 5 times, most recently from 08ffad6 to 4e6e481 Compare July 4, 2025 17:07
@dongluw dongluw force-pushed the shm_store branch 3 times, most recently from 340066c to a1880ad Compare July 7, 2025 04:24
@dongluw dongluw marked this pull request as ready for review July 7, 2025 05:51
@DarkLight1337
Copy link
Member

Can you merge from main to fix pre-commit?

Signed-off-by: donglu <[email protected]>

Address comments

Signed-off-by: donglu <[email protected]>
@dongluw dongluw force-pushed the shm_store branch 3 times, most recently from 506ad2b to 2a56e7c Compare September 11, 2025 21:33
@dongluw
Copy link
Contributor Author

dongluw commented Sep 12, 2025

bc lint check complains it's not able to see three files newly added by the pr, not sure how to fix it, @DarkLight1337 any thoughts?

fatal: path 'tests/distributed/test_shm_buffer.py' exists on disk, but not in '12a8414d81e186c2db397d7f1ecb5c17e614c6bd'
fatal: path 'tests/distributed/test_shm_storage.py' exists on disk, but not in '12a8414d81e186c2db397d7f1ecb5c17e614c6bd'
fatal: path 'vllm/distributed/device_communicators/shm_object_storage.py' exists on disk, but not in '12a8414d81e186c2db397d7f1ecb5c17e614c6bd'

@DarkLight1337
Copy link
Member

Let me just force-merge this

@vllm-bot vllm-bot merged commit a5b84f1 into vllm-project:main Sep 12, 2025
71 of 74 checks passed
@mgoin
Copy link
Member

mgoin commented Sep 12, 2025

@DarkLight1337 this broke pre-commit on main

Error: vllm/v1/executor/utils.py:22: error: "NewRequestData" has no attribute "mm_kwargs"  [attr-defined]
Error: vllm/v1/executor/utils.py:23: error: "NewRequestData" has no attribute "mm_kwargs"  [attr-defined]
Error: vllm/v1/executor/utils.py:24: error: "NewRequestData" has no attribute "mm_kwargs"  [attr-defined]

@DarkLight1337
Copy link
Member

Working on a fix

DarkLight1337 added a commit to DarkLight1337/vllm that referenced this pull request Sep 12, 2025
@DarkLight1337
Copy link
Member

Opened #24754

simon-mo pushed a commit that referenced this pull request Sep 12, 2025
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Sep 14, 2025
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Sep 14, 2025
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
cboss6 pushed a commit to cboss6/vllm that referenced this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Multimodal data IPC improvement
6 participants