Support wav2vec base models #9637

kijai · 2025-08-30T18:00:03Z

Added support to load wav2vec2-base models, such as chinese-wav2vec2-base that is used with Multi/InfiniteTalk.

Tested by using the resulting audio features with WanVideoWrapper's InfiniteTalk implementation.
Tested that english base model also loads and encodes, but haven't tested resulting audio features.
Tested that native S2V workflow using the large English model still works and produces identical results.

This is first step to bring Multi/InfiniteTalk to the core.

comfyanonymous · 2025-08-30T21:41:20Z

comfy/audio_encoders/wav2vec2.py


+        if fps != 50:


Is there any reason this shouldn't be done by the node using the outputs like the s2v node? https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_wan.py#L892

I'm unsure about that, this was the way it was done in the initial MultiTalk implementation. I guess the reason could be efficiency as this way the model has to process less? Though the models being as light as they are, I'm not sure it matters, mostly I was trying to match how it was done in the initial MultiTalk implementation to get matching results.

I can test if it makes a difference.

I think the best is to have the encoder nodes return the raw output from the model as much as possible which makes it easier if multiple models use the output from the same encoder but with different processing.

Well I agree with that principle of course. Tested now and there is a slight difference in results when interpolating before the encode and when interpolating the already encoded features, with the same function as S2V, should we worry about that?

AnimateDiff_00005-audio.webm

That's weird, why is there a difference?

I don't know, but looking at the tensors there's already a slight difference in the features if the interpolation is done before feature_projection:

lerp before proj: shape=torch.Size([1, 601, 768]) min=-12.0035 max=11.7886 mean=-0.0433 std=1.6125 lerp after proj: shape=torch.Size([1, 601, 768]) min=-11.9092 max=10.5639 mean=-0.0436 std=1.5525 Mean abs diff: 0.053896

and similarly between encoding interpolated features and interpolating encoded features:

lerp before encode: shape=torch.Size([13, 601, 768]) min=-4.3660 max=4.7442 mean=0.0022 std=0.3017 lerp after encode: shape=torch.Size([13, 601, 768]) min=-3.9754 max=5.0701 mean=0.0027 std=0.2990 Overall mean abs diff across all layers: 0.079161

It's very small, can it really be that sensitive, maybe it matters if the model is trained like that?

The original implementation is specifically overriding the Wav2vec2 -model to do the interpolation before projection and encode:
https://github.com/MeiGen-AI/MultiTalk/blob/636c774370d40214e6be1ea1ed98bc219dfbd8c4/src/audio_analysis/wav2vec2.py#L32

Another longer comparison test with better settings overall still shows a difference, but I can't really say if it's better or worse:

AnimateDiff_00002-audio.webm

Anyway there's no hurry with this PR as the native model implementation still needs to be done to have any use for this, so which way to do this can be decided later too.

Do all the models do the same interpolation?

Do all the models do the same interpolation?

As far as I can see, the normal Wav2Vec2Model implementation in transformers uses same forward for base and large, and it doesn't involve interpolation, so it seems it's just specifically this chinese model from Tencent that uses it, and it would explain why they use custom code for it. The MultiTalk paper didn't mention the fps used or anything, so I'm just guessing.

I did end up leaving the interpolation later as you suggested, and while the result is slightly different, it has been working fine. If there are no other issues left it would be fine by me to merge this.

Support wav2vec base models

b64004f

kijai requested a review from comfyanonymous as a code owner August 30, 2025 18:00

trim trailing whitespace

66e1091

comfyanonymous reviewed Aug 30, 2025

View reviewed changes

Kosinkadink added the Core Core team dependency label Aug 30, 2025

kijai added 3 commits August 31, 2025 19:12

Merge remote-tracking branch 'upstream/master'

874a067

Merge remote-tracking branch 'upstream/master'

ca2e7f3

Do interpolation after

bfa45fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support wav2vec base models #9637

Support wav2vec base models #9637

kijai commented Aug 30, 2025

Uh oh!

comfyanonymous Aug 30, 2025

Uh oh!

kijai Aug 30, 2025

Uh oh!

comfyanonymous Aug 30, 2025

Uh oh!

kijai Aug 31, 2025

Uh oh!

comfyanonymous Aug 31, 2025

Uh oh!

kijai Aug 31, 2025

Uh oh!

comfyanonymous Sep 1, 2025

Uh oh!

kijai Sep 1, 2025

Uh oh!

kijai Sep 10, 2025

Uh oh!

Uh oh!

Support wav2vec base models #9637

Are you sure you want to change the base?

Support wav2vec base models #9637

Conversation

kijai commented Aug 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!