DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment

WACV 2026

¹National Taiwan University of Science and Technology, ²National Taiwan University, ³Microsoft Taiwan

Paper Supplementary arXiv Video (comming soon) Code

TL;DR: We propose a mask-free and prompt-free drag-based image editing method that achieves state-of-the-art quality and precision among mask-free methods.

Abstract

Drag-based image editing using generative models provides intuitive control over image structures. However, existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision. Removing these constraints creates a fundamental trade-off: visual artifacts without masks and poor spatial control without prompts. To address these limitations, we propose DirectDrag, a novel mask- and prompt-free editing framework. DirectDrag enables precise and efficient manipulation with minimal user input while maintaining high image fidelity and accurate point alignment. DirectDrag introduces two key innovations. First, we design an Auto Soft Mask Generation module that intelligently infers editable regions from point displacement, automatically localizing deformation along movement paths while preserving contextual integrity through the generative model's inherent capacity. Second, we develop a Readout-Guided Feature Alignment mechanism that leverages intermediate diffusion activations to maintain structural consistency during point-based edits, substantially improving visual fidelity. Despite operating without manual mask or prompt, DirectDrag achieves superior image quality compared to existing methods while maintaining competitive drag accuracy. Extensive experiments on DragBench and real-world scenarios demonstrate the effectiveness and practicality of DirectDrag for high-quality, interactive image manipulation.

Method

Overview of the proposed DirectDrag framework

Given an input image and point pairs, we apply DDIM inversion to obtain latent codes, initialize editing via latent warpage function and generate soft mask, then iteratively apply drag and denoising guided by motion supervision and feature alignment.

Effect of our Soft Mask

Left: Compared to no masking and user provide hard mask, applying the generated soft mask significantly improves visual fidelity and structure preservation, as reflected by higher image fidelity scores (1-LPIPS↑). Right: Visualization of soft masks under different drag configurations and Gaussian widths(σ), illustrating their adaptiveness to motion magnitude and direction.

Readout Network Training and Effect

Left: We train the readout network using a triplet loss on diffusion features extracted from video frames (anchor, positive) and edited images (negative). Right: Incorporating readout guidance preserves appearance details and improves structural consistency during dragging.

Experiments

Quantitative Evaluation

Qualitative Results

Qualitative comparison. Compared to the baseline(GoodDrag) and mask-free methods(AdaptiveDrag, InstantDrag), our method DirectDrag

Ablation Study

Acknowledgements

We would like to thank Jenq-Shiou Leu, Shan-Hsiang Shen, Yong-Xiang Lin for serving as our oral defense committee members and providing valuable suggestions, and Wei-Tung Lin, Bo-Chih Chuang, Meng-Ting Jhong, Jyun-Wei Chen, for helpful discussions and feedback.

BibTeX

@article{liao2025directdrag, title={DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment}, author={Liao, Sheng-Hao and Chen, Shang-Fu and Huang, Tai-Ming and Cheng, Wen-Huang and Hua, Kai-Lung}, journal={arXiv preprint arXiv:2512.03981}, year={2025} }