-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Support wav2vec base models #9637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
comfy/audio_encoders/wav2vec2.py
Outdated
|
||
if fps != 50: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason this shouldn't be done by the node using the outputs like the s2v node? https://github.com/comfyanonymous/ComfyUI/blob/master/comfy_extras/nodes_wan.py#L892
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about that, this was the way it was done in the initial MultiTalk implementation. I guess the reason could be efficiency as this way the model has to process less? Though the models being as light as they are, I'm not sure it matters, mostly I was trying to match how it was done in the initial MultiTalk implementation to get matching results.
I can test if it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best is to have the encoder nodes return the raw output from the model as much as possible which makes it easier if multiple models use the output from the same encoder but with different processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I agree with that principle of course. Tested now and there is a slight difference in results when interpolating before the encode and when interpolating the already encoded features, with the same function as S2V, should we worry about that?
AnimateDiff_00005-audio.webm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird, why is there a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but looking at the tensors there's already a slight difference in the features if the interpolation is done before feature_projection:
lerp before proj: shape=torch.Size([1, 601, 768]) min=-12.0035 max=11.7886 mean=-0.0433 std=1.6125
lerp after proj: shape=torch.Size([1, 601, 768]) min=-11.9092 max=10.5639 mean=-0.0436 std=1.5525
Mean abs diff: 0.053896
and similarly between encoding interpolated features and interpolating encoded features:
lerp before encode: shape=torch.Size([13, 601, 768]) min=-4.3660 max=4.7442 mean=0.0022 std=0.3017
lerp after encode: shape=torch.Size([13, 601, 768]) min=-3.9754 max=5.0701 mean=0.0027 std=0.2990
Overall mean abs diff across all layers: 0.079161
It's very small, can it really be that sensitive, maybe it matters if the model is trained like that?
The original implementation is specifically overriding the Wav2vec2 -model to do the interpolation before projection and encode:
https://github.com/MeiGen-AI/MultiTalk/blob/636c774370d40214e6be1ea1ed98bc219dfbd8c4/src/audio_analysis/wav2vec2.py#L32
Another longer comparison test with better settings overall still shows a difference, but I can't really say if it's better or worse:
AnimateDiff_00002-audio.webm
Anyway there's no hurry with this PR as the native model implementation still needs to be done to have any use for this, so which way to do this can be decided later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all the models do the same interpolation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all the models do the same interpolation?
As far as I can see, the normal Wav2Vec2Model implementation in transformers uses same forward for base and large, and it doesn't involve interpolation, so it seems it's just specifically this chinese model from Tencent that uses it, and it would explain why they use custom code for it. The MultiTalk paper didn't mention the fps used or anything, so I'm just guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did end up leaving the interpolation later as you suggested, and while the result is slightly different, it has been working fine. If there are no other issues left it would be fine by me to merge this.
Added support to load wav2vec2-base models, such as chinese-wav2vec2-base that is used with Multi/InfiniteTalk.
Tested by using the resulting audio features with WanVideoWrapper's InfiniteTalk implementation.
Tested that english base model also loads and encodes, but haven't tested resulting audio features.
Tested that native S2V workflow using the large English model still works and produces identical results.
This is first step to bring Multi/InfiniteTalk to the core.