Skip to content

Conversation

syunar
Copy link

@syunar syunar commented Sep 9, 2025

What does this PR do?

Add Lumina Accessory, a multi-task instruction fine-tuning framework designed for the Lumina series (currently supporting Lumina-Image-2.0). The official repository is from Alpha-VLLM/Lumina-Accessory

Inference

import torch
from diffusers.utils import load_image
from diffusers import Lumina2AccessoryTransformer2DModel, Lumina2AccessoryPipeline


ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Accessory/blob/main/consolidated.00-of-01.pth"
transformer = Lumina2AccessoryTransformer2DModel.from_single_file(ckpt_path, torch_dtype=torch.bfloat16)
pipe = Lumina2AccessoryPipeline.from_pretrained(
    "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
)
device = "cuda"
pipe.to(device)

test_cases = [
    {
        "task": "Image Infilling",
        "input_image": "https://github.com/Alpha-VLLM/Lumina-Accessory/blob/main/examples/case_1_condition.jpg?raw=true",
        "system_prompt": "You are an assistant designed to generate superior images with the highest degree of image-text alignment based on textual prompts and a partially masked image.",
        "prompt": "A classical oil painting of a young woman dressed in a modern DARK BLACK leather jacket.",
        "cond_position_type": "aligned",
    },
    {
        "task": "Palette Condition",
        "input_image": "https://github.com/Alpha-VLLM/Lumina-Accessory/blob/main/examples/case_2_condition.jpg?raw=true",
        "system_prompt": "You are an assistant designed to generate superior images with the highest degree of image-text alignment based on textual prompts and a palette map condition.",
        "prompt": "A still life photograph of a floral arrangement in a rustic, blue ceramic vase, centrally positioned on a round table draped with a delicate, white tablecloth. The bouquet features a mix of vibrant flowers, including large yellow roses, orange carnations, and smaller white blossoms, interspersed with green foliage and sprigs of orange buds. The vase is with the flowers extending upwards and outwards, creating a dynamic composition. In the background, hanging on the textured, beige wallpaper with a subtle floral pattern, is a traditional Chinese scroll featuring elegant calligraphy in classical Wenyanwen (文言文). The presence of the scroll adds a refined, cultural depth to the vintage setting. Soft, natural lighting casts gentle shadows, enhancing the textures of the vase and the lace. The overall atmosphere is serene and nostalgic, with a warm, muted color palette, medium depth of field, and a classic, timeless aesthetic.",
        "cond_position_type": "aligned",
    },
    {
        "task": "Depth Condition",
        "input_image": "https://github.com/Alpha-VLLM/Lumina-Accessory/blob/main/examples/case_3_condition.jpg?raw=true",
        "system_prompt": "You are an assistant designed to generate superior images with the highest degree of image-text alignment based on textual prompts and a depth map condition.",
        "prompt": "A contemplative photograph of a person with short brown hair, wearing a dark jacket, standing in the lower left foreground, facing away towards a field of tall, dried grasses. The grasses dominate the middle ground, their brown and beige tones contrasting with the dark jacket. The background features a cloudy, overcast sky with a soft, diffused light, creating a serene and introspective atmosphere. The composition is balanced with the person anchoring the lower left and the expansive sky occupying the upper half. The image has a muted color palette, emphasizing earthy tones and a sense of solitude. Photographic style, medium depth of field, natural lighting, soft focus, tranquil, introspective mood.",
        "cond_position_type": "aligned",
    },
]

generator = torch.Generator(device=device).manual_seed(0)
W, H = 1024, 1024
for test_case in test_cases:
    img = load_image(test_case["input_image"])
    w, h = img.size
    img = img.resize((W, H))

    output = pipe(
        image=img,
        prompt=test_case["prompt"],
        system_prompt=test_case["system_prompt"],
        negative_prompt="",
        num_inference_steps=25,
        width=img.size[0],
        height=img.size[1],
        num_images_per_prompt=1,
        guidance_scale=4.0,
        cfg_trunc_ratio=1.0,
        cfg_normalization=True,
        cond_position_type=test_case["cond_position_type"],
        generator=generator,
    ).images[0]

    output = output.resize((w, h))
    img = img.resize((w, h))

    img.save(f"test_lumina2_accessory_{test_case['task'].strip().replace(' ', '_').lower()}_input.png")
    output.save(f"test_lumina2_accessory_{test_case['task'].strip().replace(' ', '_').lower()}_output.png")

Sanity Check

Image Infilling
Input Image Output Image
Palette Condition
Input Image Output Image
Depth Condition
Input Image Output Image

@syunar
Copy link
Author

syunar commented Sep 9, 2025

Hi @sayakpaul @yiyixuxu @a-r-r-o-w — ready for review. Let me know if I can do anything to make it easier.

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu September 9, 2025 07:17
@syunar
Copy link
Author

syunar commented Sep 16, 2025

@yiyixuxu gentle ping — I’ve fixed the failing checks, could you rerun the tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant