You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[feat] Add ONNX, OV support for SparseEncoder; refactor ONNX/OV (#3475)
* Add ONNX, OV support for SparseEncoder; refactor ONNX/OV
... for SentenceTransformer and CrossEncoder
Also add tests for SparseEncoder and CrossEncoder ONNX/OV
Move backend code to a separate directory
* Allow optimization/quantization of SparseEncoder ONNX/OV models
* Undo accidentally pushed changes
* Revert accidental addition
* Remove double logger
* Fix ValueError: openvino instead of onnx
* Add benchmarks and documentation for SparseEncoder ONNX/OV
* Fix docstring: model_args -> model_kwargs
Copy file name to clipboardExpand all lines: docs/cross_encoder/usage/efficiency.rst
+26-20Lines changed: 26 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,7 +166,7 @@ Optimizing ONNX Models
166
166
167
167
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
168
168
169
-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
169
+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
170
170
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
171
171
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
172
172
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
@@ -183,9 +183,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
183
183
184
184
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
240
242
241
-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
243
+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
242
244
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
243
245
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
244
246
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
@@ -257,9 +259,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
257
259
258
260
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
@@ -524,25 +530,25 @@ The following images show the benchmark results for the different backends on GP
524
530
<code>onnx</code>: ONNX with float32 precision, via <code>backend="onnx"</code>.
525
531
</li>
526
532
<li>
527
-
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., "O1", ...)</code> and <code>backend="onnx"</code>.
533
+
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O1", ...)</code> and <code>backend="onnx"</code>.
528
534
</li>
529
535
<li>
530
-
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., "O2", ...)</code> and <code>backend="onnx"</code>.
536
+
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O2", ...)</code> and <code>backend="onnx"</code>.
531
537
</li>
532
538
<li>
533
-
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., "O3", ...)</code> and <code>backend="onnx"</code>.
539
+
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O3", ...)</code> and <code>backend="onnx"</code>.
534
540
</li>
535
541
<li>
536
-
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., "O4", ...)</code> and <code>backend="onnx"</code>.
542
+
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O4", ...)</code> and <code>backend="onnx"</code>.
537
543
</li>
538
544
<li>
539
-
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., "avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
545
+
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., quantization_config="avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
540
546
</li>
541
547
<li>
542
548
<code>openvino</code>: OpenVINO, via <code>backend="openvino"</code>.
543
549
</li>
544
550
<li>
545
-
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
551
+
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., quantization_config=OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
546
552
</li>
547
553
</ul>
548
554
</li>
@@ -577,7 +583,7 @@ Based on the benchmarks, this flowchart should help you decide which backend to
577
583
}}%%
578
584
graph TD
579
585
A("What is your hardware?") -->|GPU| B("Are you using a small<br>batch size?")
580
-
A -->|CPU| C("Are you open to<br>quantization?")
586
+
A -->|CPU| C("Are minor performance<br>degradations acceptable?")
Copy file name to clipboardExpand all lines: docs/sentence_transformer/usage/efficiency.rst
+25-19Lines changed: 25 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -134,7 +134,7 @@ Optimizing ONNX Models
134
134
135
135
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
136
136
137
-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
137
+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
138
138
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
139
139
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
140
140
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
@@ -151,9 +151,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
151
151
152
152
model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
208
210
209
-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
211
+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
210
212
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
211
213
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
212
214
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
@@ -225,9 +227,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
225
227
226
228
model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
@@ -487,25 +493,25 @@ The following images show the benchmark results for the different backends on GP
487
493
<code>onnx</code>: ONNX with float32 precision, via <code>backend="onnx"</code>.
488
494
</li>
489
495
<li>
490
-
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., "O1", ...)</code> and <code>backend="onnx"</code>.
496
+
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O1", ...)</code> and <code>backend="onnx"</code>.
491
497
</li>
492
498
<li>
493
-
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., "O2", ...)</code> and <code>backend="onnx"</code>.
499
+
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O2", ...)</code> and <code>backend="onnx"</code>.
494
500
</li>
495
501
<li>
496
-
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., "O3", ...)</code> and <code>backend="onnx"</code>.
502
+
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O3", ...)</code> and <code>backend="onnx"</code>.
497
503
</li>
498
504
<li>
499
-
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., "O4", ...)</code> and <code>backend="onnx"</code>.
505
+
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O4", ...)</code> and <code>backend="onnx"</code>.
500
506
</li>
501
507
<li>
502
-
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., "avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
508
+
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., quantization_config="avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
503
509
</li>
504
510
<li>
505
511
<code>openvino</code>: OpenVINO, via <code>backend="openvino"</code>.
506
512
</li>
507
513
<li>
508
-
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
514
+
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., quantization_config=OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
0 commit comments