Support ControlNet for Qwen-Image-Edit #12325

dimitribarbot · 2025-09-13T12:42:21Z

What does this PR do?

Add ControlNet (InstantX/Qwen-Image-ControlNet-Union) support for Qwen-Image-Edit.

This pipeline enables two latent images to be used as inputs: one for Qwen-Image-Edit and another for Qwen-Image-ControlNet-Union. This provides greater control over the expected results.

Inference

import torch
from diffusers import QwenImageControlNetModel, QwenImageEditControlNetPipeline
from diffusers.utils import load_image

base_model = "Qwen/Qwen-Image-Edit"
controlnet_model = "InstantX/Qwen-Image-ControlNet-Union"

controlnet = QwenImageControlNetModel.from_pretrained(controlnet_model, torch_dtype=torch.bfloat16)

pipe = QwenImageEditControlNetPipeline.from_pretrained(
    base_model, controlnet=controlnet, torch_dtype=torch.bfloat16
).to("cuda")

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/living_room.png"
).convert("RGB")
control_image = load_image(
    "https://huggingface.co/InstantX/Qwen-Image-ControlNet-Union/resolve/main/conds/depth.png"
)
prompt = (
    "Anime style of a swanky, minimalist living room with a huge floor-to-ceiling window letting in loads of natural light. "
    "A beige couch with white and beige cushions sits on a wooden floor, with a matching coffee table in front. "
    "The walls are a soft, warm beige, decorated with two framed botanical prints. A potted plant chills in the corner near the window. "
    "Sunlight pours through the leaves outside, casting cool shadows on the floor."
)
image = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=" ",
    control_image=image,
    controlnet_conditioning_scale=1.5,
    width=control_image.size[0],
    height=control_image.size[1],
    num_inference_steps=30,
    true_cfg_scale=2.5,
).images[0]
image.save("qwenimage_edit_controlnet.png")

N.B.1. If this PR and image location are accepted, I will upload the living_room.png file to the documentation-images repository.
N.B.2. To achieve the desired result, set controlnet_conditioning_scale to a value greater than 1. A good starting point is 1.5.

Examples

Depth

Input image:

Control image:

prompt = (
    "Anime style of a swanky, minimalist living room with a huge floor-to-ceiling window letting in loads of natural light. "
    "A beige couch with white and beige cushions sits on a wooden floor, with a matching coffee table in front. "
    "The walls are a soft, warm beige, decorated with two framed botanical prints. A potted plant chills in the corner near the window. "
    "Sunlight pours through the leaves outside, casting cool shadows on the floor."
)

Result:

Pose

Input image:

Control image:

prompt = (
    "Make this man sit on a concrete ledge in front of a large circular window, with a cityscape reflected in the glass. "
    "The wall is cream-colored, and the sky is clear blue. His shadow is cast on the wall."
)

Result:

Whereas if we don't use controlnet:

N.B. All examples were created using the Nunchaku version of the transformer.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yiyixuxu
@asomoza

HuggingFaceDocBuilderDev · 2025-09-16T23:18:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu

thanks! i left some comments
didn't know the instant x controlnet is compatible with qwen edit :)

yiyixuxu · 2025-09-16T23:13:26Z