Skip to content

Conversation

kozistr
Copy link
Contributor

@kozistr kozistr commented Jun 17, 2025

What does this PR do?

Fixes #642

Changes

  • left padding (w/ pad token id)
  • causal attention mask
  • fix last token pooling for the batch case
  • Currently, batch inference works fine, but single-query inference has a problem.
  • (updated) I've further investigated this issue and found that it's due to the attention logic.
    • self.config._attn_implementation of Qwen3 Embedding config is sdpa by default, and if we change this value to eager, which is naive impl of self-attention, we could check the identical output with this PR as expected.
    • but, looks like sdpa performs something different, need to investigate more on this

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @alvarobartt

Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @kozistr and apologies I missed those points! LGTM

@kozistr
Copy link
Contributor Author

kozistr commented Jun 18, 2025

@alvarobartt @Narsil thanks for the quick check!

I've further investigated this issue (== single query inference seems weird) and found that it's due to the attention logic. self.config._attn_implementation of Qwen3 Embedding config is sdpa by default, and if we change this value to eager, which is naive implementation of self-attention, we could check the identical output with this PR as expected.

Seems like sdpa performs something different, need to investigate more on this.

I'll let you know when I found the root cause : ) thanks!

@Narsil Narsil merged commit 1193add into huggingface:main Jun 18, 2025
2 of 13 checks passed
@kozistr kozistr deleted the fix/qwen3 branch June 18, 2025 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Qwen3-Embedding models: embeddings from TEI differ sharply from Sentence-Transformers reference
3 participants