Skip to content

Commit 99dda77

Browse files
authored
[feat] Add ONNX, OV support for SparseEncoder; refactor ONNX/OV (#3475)
* Add ONNX, OV support for SparseEncoder; refactor ONNX/OV ... for SentenceTransformer and CrossEncoder Also add tests for SparseEncoder and CrossEncoder ONNX/OV Move backend code to a separate directory * Allow optimization/quantization of SparseEncoder ONNX/OV models * Undo accidentally pushed changes * Revert accidental addition * Remove double logger * Fix ValueError: openvino instead of onnx * Add benchmarks and documentation for SparseEncoder ONNX/OV * Fix docstring: model_args -> model_kwargs
1 parent 69406a3 commit 99dda77

File tree

18 files changed

+2013
-956
lines changed

18 files changed

+2013
-956
lines changed

docs/cross_encoder/usage/efficiency.rst

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ Optimizing ONNX Models
166166

167167
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
168168

169-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
169+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
170170
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
171171
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
172172
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
@@ -183,9 +183,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
183183

184184
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
185185
export_optimized_onnx_model(
186-
model,
187-
"O3",
188-
"cross-encoder/ms-marco-MiniLM-L6-v2",
186+
model=model,
187+
optimization_config="O3",
188+
model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
189189
push_to_hub=True,
190190
create_pr=True,
191191
)
@@ -219,7 +219,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
219219
from sentence_transformers import CrossEncoder, export_optimized_onnx_model
220220

221221
model = CrossEncoder("path/to/my/mpnet-legal-finetuned", backend="onnx")
222-
export_optimized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")
222+
export_optimized_onnx_model(
223+
model=model, optimization_config="O3", model_name_or_path="path/to/my/mpnet-legal-finetuned"
224+
)
223225

224226
After optimizing::
225227

@@ -238,7 +240,7 @@ Quantizing ONNX Models
238240

239241
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
240242

241-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
243+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
242244
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
243245
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
244246
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
@@ -257,9 +259,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
257259

258260
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx")
259261
export_dynamic_quantized_onnx_model(
260-
model,
261-
"avx512_vnni",
262-
"sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2",
262+
model=model,
263+
quantization_config="avx512_vnni",
264+
model_name_or_path="sentence-transformers/cross-encoder/ms-marco-MiniLM-L6-v2",
263265
push_to_hub=True,
264266
create_pr=True,
265267
)
@@ -293,7 +295,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
293295
from sentence_transformers import CrossEncoder, export_dynamic_quantized_onnx_model
294296

295297
model = CrossEncoder("path/to/my/mpnet-legal-finetuned", backend="onnx")
296-
export_dynamic_quantized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")
298+
export_dynamic_quantized_onnx_model(
299+
model=model, quantization_config="avx512_vnni", model_name_or_path="path/to/my/mpnet-legal-finetuned"
300+
)
297301

298302
After quantizing::
299303

@@ -374,7 +378,7 @@ To do this, you can use the :func:`~sentence_transformers.backend.export_static_
374378
which saves the quantized model in a directory or model repository that you specify.
375379
Post-Training Static Quantization expects:
376380

377-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
381+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the OpenVINO backend.
378382
- ``quantization_config``: (Optional) The quantization configuration. This parameter accepts either:
379383
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
380384
an :class:`~optimum.intel.OVQuantizationConfig` instance.
@@ -397,7 +401,7 @@ See this example for quantizing a model to ``int8`` with `static quantization <h
397401

398402
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="openvino")
399403
export_static_quantized_openvino_model(
400-
model,
404+
model=model,
401405
quantization_config=None,
402406
model_name_or_path="cross-encoder/ms-marco-MiniLM-L6-v2",
403407
push_to_hub=True,
@@ -435,7 +439,9 @@ See this example for quantizing a model to ``int8`` with `static quantization <h
435439

436440
model = CrossEncoder("path/to/my/mpnet-legal-finetuned", backend="openvino")
437441
quantization_config = OVQuantizationConfig()
438-
export_static_quantized_openvino_model(model, quantization_config, "path/to/my/mpnet-legal-finetuned")
442+
export_static_quantized_openvino_model(
443+
model=model, quantization_config=quantization_config, model_name_or_path="path/to/my/mpnet-legal-finetuned"
444+
)
439445

440446
After quantizing::
441447

@@ -524,25 +530,25 @@ The following images show the benchmark results for the different backends on GP
524530
<code>onnx</code>: ONNX with float32 precision, via <code>backend="onnx"</code>.
525531
</li>
526532
<li>
527-
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., "O1", ...)</code> and <code>backend="onnx"</code>.
533+
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O1", ...)</code> and <code>backend="onnx"</code>.
528534
</li>
529535
<li>
530-
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., "O2", ...)</code> and <code>backend="onnx"</code>.
536+
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O2", ...)</code> and <code>backend="onnx"</code>.
531537
</li>
532538
<li>
533-
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., "O3", ...)</code> and <code>backend="onnx"</code>.
539+
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O3", ...)</code> and <code>backend="onnx"</code>.
534540
</li>
535541
<li>
536-
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., "O4", ...)</code> and <code>backend="onnx"</code>.
542+
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O4", ...)</code> and <code>backend="onnx"</code>.
537543
</li>
538544
<li>
539-
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., "avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
545+
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., quantization_config="avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
540546
</li>
541547
<li>
542548
<code>openvino</code>: OpenVINO, via <code>backend="openvino"</code>.
543549
</li>
544550
<li>
545-
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
551+
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., quantization_config=OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
546552
</li>
547553
</ul>
548554
</li>
@@ -577,7 +583,7 @@ Based on the benchmarks, this flowchart should help you decide which backend to
577583
}}%%
578584
graph TD
579585
A("What is your hardware?") -->|GPU| B("Are you using a small<br>batch size?")
580-
A -->|CPU| C("Are you open to<br>quantization?")
586+
A -->|CPU| C("Are minor performance<br>degradations acceptable?")
581587
B -->|yes| D[onnx-O4]
582588
B -->|no| F[float16]
583589
C -->|yes| G[openvino-qint8]
53 KB
Loading
50.2 KB
Loading

docs/sentence_transformer/usage/efficiency.rst

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ Optimizing ONNX Models
134134

135135
ONNX models can be optimized using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for speedups on CPUs and GPUs alike. To do this, you can use the :func:`~sentence_transformers.backend.export_optimized_onnx_model` function, which saves the optimized in a directory or model repository that you specify. It expects:
136136

137-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
137+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
138138
- ``optimization_config``: ``"O1"``, ``"O2"``, ``"O3"``, or ``"O4"`` representing optimization levels from :class:`~optimum.onnxruntime.AutoOptimizationConfig`, or an :class:`~optimum.onnxruntime.OptimizationConfig` instance.
139139
- ``model_name_or_path``: a path to save the optimized model file, or the repository name if you want to push it to the Hugging Face Hub.
140140
- ``push_to_hub``: (Optional) a boolean to push the optimized model to the Hugging Face Hub.
@@ -151,9 +151,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
151151

152152
model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
153153
export_optimized_onnx_model(
154-
model,
155-
"O3",
156-
"sentence-transformers/all-MiniLM-L6-v2",
154+
model=model,
155+
optimization_config="O3",
156+
model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
157157
push_to_hub=True,
158158
create_pr=True,
159159
)
@@ -187,7 +187,9 @@ See this example for exporting a model with :doc:`optimization level 3 <optimum:
187187
from sentence_transformers import SentenceTransformer, export_optimized_onnx_model
188188

189189
model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="onnx")
190-
export_optimized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")
190+
export_optimized_onnx_model(
191+
model=model, optimization_config="O3", model_name_or_path="path/to/my/mpnet-legal-finetuned"
192+
)
191193

192194
After optimizing::
193195

@@ -206,7 +208,7 @@ Quantizing ONNX Models
206208

207209
ONNX models can be quantized to int8 precision using `Optimum <https://huggingface.co/docs/optimum/index>`_, allowing for faster inference on CPUs. To do this, you can use the :func:`~sentence_transformers.backend.export_dynamic_quantized_onnx_model` function, which saves the quantized in a directory or model repository that you specify. Dynamic quantization, unlike static quantization, does not require a calibration dataset. It expects:
208210

209-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the ONNX backend.
211+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the ONNX backend.
210212
- ``quantization_config``: ``"arm64"``, ``"avx2"``, ``"avx512"``, or ``"avx512_vnni"`` representing quantization configurations from :class:`~optimum.onnxruntime.AutoQuantizationConfig`, or an :class:`~optimum.onnxruntime.QuantizationConfig` instance.
211213
- ``model_name_or_path``: a path to save the quantized model file, or the repository name if you want to push it to the Hugging Face Hub.
212214
- ``push_to_hub``: (Optional) a boolean to push the quantized model to the Hugging Face Hub.
@@ -225,9 +227,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
225227

226228
model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
227229
export_dynamic_quantized_onnx_model(
228-
model,
229-
"avx512_vnni",
230-
"sentence-transformers/all-MiniLM-L6-v2",
230+
model=model,
231+
quantization_config="avx512_vnni",
232+
model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
231233
push_to_hub=True,
232234
create_pr=True,
233235
)
@@ -261,7 +263,9 @@ See this example for quantizing a model to ``int8`` with :doc:`avx512_vnni <opti
261263
from sentence_transformers import SentenceTransformer, export_dynamic_quantized_onnx_model
262264

263265
model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="onnx")
264-
export_dynamic_quantized_onnx_model(model, "O3", "path/to/my/mpnet-legal-finetuned")
266+
export_dynamic_quantized_onnx_model(
267+
model=model, quantization_config="avx512_vnni", model_name_or_path="path/to/my/mpnet-legal-finetuned"
268+
)
265269

266270
After quantizing::
267271

@@ -334,7 +338,7 @@ To do this, you can use the :func:`~sentence_transformers.backend.export_static_
334338
which saves the quantized model in a directory or model repository that you specify.
335339
Post-Training Static Quantization expects:
336340

337-
- ``model``: a Sentence Transformer or Cross Encoder model loaded with the OpenVINO backend.
341+
- ``model``: a Sentence Transformer, Sparse Encoder, or Cross Encoder model loaded with the OpenVINO backend.
338342
- ``quantization_config``: (Optional) The quantization configuration. This parameter accepts either:
339343
``None`` for the default 8-bit quantization, a dictionary representing quantization configurations, or
340344
an :class:`~optimum.intel.OVQuantizationConfig` instance.
@@ -357,7 +361,7 @@ See this example for quantizing a model to ``int8`` with `static quantization <h
357361

358362
model = SentenceTransformer("all-MiniLM-L6-v2", backend="openvino")
359363
export_static_quantized_openvino_model(
360-
model,
364+
model=model,
361365
quantization_config=None,
362366
model_name_or_path="sentence-transformers/all-MiniLM-L6-v2",
363367
push_to_hub=True,
@@ -395,7 +399,9 @@ See this example for quantizing a model to ``int8`` with `static quantization <h
395399

396400
model = SentenceTransformer("path/to/my/mpnet-legal-finetuned", backend="openvino")
397401
quantization_config = OVQuantizationConfig()
398-
export_static_quantized_openvino_model(model, quantization_config, "path/to/my/mpnet-legal-finetuned")
402+
export_static_quantized_openvino_model(
403+
model=model, quantization_config=quantization_config, model_name_or_path="path/to/my/mpnet-legal-finetuned"
404+
)
399405

400406
After quantizing::
401407

@@ -487,25 +493,25 @@ The following images show the benchmark results for the different backends on GP
487493
<code>onnx</code>: ONNX with float32 precision, via <code>backend="onnx"</code>.
488494
</li>
489495
<li>
490-
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., "O1", ...)</code> and <code>backend="onnx"</code>.
496+
<code>onnx-O1</code>: ONNX with float32 precision and O1 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O1", ...)</code> and <code>backend="onnx"</code>.
491497
</li>
492498
<li>
493-
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., "O2", ...)</code> and <code>backend="onnx"</code>.
499+
<code>onnx-O2</code>: ONNX with float32 precision and O2 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O2", ...)</code> and <code>backend="onnx"</code>.
494500
</li>
495501
<li>
496-
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., "O3", ...)</code> and <code>backend="onnx"</code>.
502+
<code>onnx-O3</code>: ONNX with float32 precision and O3 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O3", ...)</code> and <code>backend="onnx"</code>.
497503
</li>
498504
<li>
499-
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., "O4", ...)</code> and <code>backend="onnx"</code>.
505+
<code>onnx-O4</code>: ONNX with float16 precision and O4 optimization, via <code>export_optimized_onnx_model(..., optimization_config="O4", ...)</code> and <code>backend="onnx"</code>.
500506
</li>
501507
<li>
502-
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., "avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
508+
<code>onnx-qint8</code>: ONNX quantized to int8 with "avx512_vnni", via <code>export_dynamic_quantized_onnx_model(..., quantization_config="avx512_vnni", ...)</code> and <code>backend="onnx"</code>. The different quantization configurations resulted in roughly equivalent speedups.
503509
</li>
504510
<li>
505511
<code>openvino</code>: OpenVINO, via <code>backend="openvino"</code>.
506512
</li>
507513
<li>
508-
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
514+
<code>openvino-qint8</code>: OpenVINO quantized to int8 via <code>export_static_quantized_openvino_model(..., quantization_config=OVQuantizationConfig(), ...)</code> and <code>backend="openvino"</code>.
509515
</li>
510516
</ul>
511517
</li>

0 commit comments

Comments
 (0)