VideoCoF: Unified Video Editing with Temporal Reasoner

A Short Video Introduction

Please turn on the sound for a better experience! Happy New Year! 🎆

Abstract

Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning.

VideoCoF enforces a “see → reason → edit” procedure by compelling the video diffusion model to first predict reasoning tokens (edit-region latents) before generating the target video tokens. This explicit reasoning step removes the need for user-provided masks while achieving precise instruction-to-region alignment and fine-grained video editing. Furthermore, we introduce a RoPE alignment strategy that leverages these reasoning tokens to ensure motion alignment and enable length extrapolation beyond the training duration. We demonstrate that with a minimal data cost of only 50k video pairs, VideoCoF achieves SOTA performance on VideoCoF-Bench, validating the efficiency and effectiveness of our approach.

Why We Need Reasoning Before Editing?

Current video editing methods typically follow two paths:
(1) expert models, which rely on external masks for precision but sacrifice unification;
(2) unified in-context learning models, which are mask-free but often struggle with spatial accuracy due to the lack of explicit cues.

This raises a critical question: Can we maintain the precision of expert models and the unification of in-context models without the mask dependency?

To resolve this conflict, we propose VideoCoF, a Chain-of-Frames approach that predicts reasoning tokens (edit-region latents) before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction-to-region alignment.

Length Extrapolation

Trained on only 50k data (33 frames), this following examples shows multi-shot editing and robust 4× length generalization.

Input

Input

Output 1

Replace brown tracksuit with a bright red leather jacket and black leggings

Output 2

Replace the brown tracksuit into a beige wool set

Seeing, Reasoning, Editing

VideoCoF adopts a "seeing, reasoning, editing" approach, where it first reasons about the editing region before implementing the corresponding edit.

Input

Reasoning

Make the dough on the cutting board crusty and golden as if freshly baked.

Input

Reasoning

Remove the blue credit card with a gradient design and white text on the far right.

Input

Reasoning

Remove the man with short dark hair wearing a gray suit on the right

Object Removal

Input

Output

Remove the young woman wearing white t-shirt and beige pants on the left

Input

Output

Remove the young Black woman in a white bathrobe and polka dot pants sitting on the right

Input

Output

Remove the young girl with blonde hair and blue top holding a game controller on the right

Input

Output

Remove the man wearing a light sweater on the left

Input

Output

Remove the young gray donkey standing near the green wheelbarrow to the right.

Input

Output

Remove the young man with short black hair wearing black shirt on the left

Input

Output

Remove the beige CRT computer setup.

Input

Output

Remove the girl with dark skin wearing blue cardigan in the background on the right

Input

Output

Remove the woman with curly hair, a black top, and a white jacket located on the right.

Object Addition

Input

Output

Add the white Samoyed with a friendly smile, calmly standing and watching the man from the left side of the greenhouse

Input

Output

Add the woman in a floral dress pointing at the balloon on the left.

Input

Output

Add the white motor yacht with flags moving steadily across the water, leaving a trailing wake.

Input

Output

Add the woman with curly afro hair, in a beige and black blazer, slightly to the left and in front of the original woman, facing the camera.

Input

Output

Add the makeup brush with a black handle and fluffy brush head moving smoothly along the cheek.

Input

Output

Add the brown and white beagle interacting with and drinking from the metallic bowl on the wooden floor.

Input

Output

Add the brown book with black edges being held steadily in the person’s right hand as if reading.

Input

Output

Add the large gorilla with a muscular build, dark fur, and a stern expression, wearing colorful beach shorts, running towards the camera on the pier.

Input

Output

Add the child with a white shirt and navy blue pants, seated inside the tent with a joyful expression, smiling while facing forward.

Input

Output

Add the man with a beard, tattoos, and yoga attire, seated on a white mat, moving his arms from a forward bend to overhead while positioned centrally on a scenic coastal patch.

Input

Output

Add a blue and white hexagonal-patterned soccer ball positioned near the player’s kicking foot, aligned with the kicking motion.

Input

Output

Add a young woman wearing light shirt and beige trousers,holding the blue game controller to the left engaged in gameplay.

Input

Output

Add a young woman, positioned to the left, mirroring the first person’s attire, standing and facing forward..

Input

Output

Add the fresh, vibrant green lettuce being held up and then dropped onto the plate with tomatoes.

Input

Output

Integrate a transparent, holographic grand piano next to the man.

Object Swap

Input

Output

Replace the white jacket worn by the young man with a beard on the left with a black leather jacket.

Input

Output

Replace the white t-shirt worn by the young woman on the left with a light blue blouse.

Input

Output

Replace the black top worn by the woman with long black hair in the center with a red floral blouse.

Input

Output

Replace the pile of crab legs in the transparent container with whole live crabs.

Input

Output

Transform the person's hair into dynamic, realistic flames.

Input

Output

Swap the white pants with light blue jeans.

Input

Output

Replace the plaid jacket worn by the woman with long light hair in the center with a navy blue blazer.

Input

Output

Replace the red hoodie worn by the young Asian woman holding chopsticks on the left with a blue denim jacket.

Input

Output

Swap the old man with long white hair and a blue checkered shirt at the left side of the frame with a woman with curly brown hair and a denim shirt.

Local Style Transfer

Input

Output

Change the turtle's shell material to crystal, reflecting light beautifully

Input

Output

Change tree bark to look like it's made of crystal

Input

Output

Make the paper bag at the right of the tissue flower bags and below the bouquet brown and crinkly.

Input

Output

Make the ketchup bottle to the right of the BBQ sauce bottle violet color.

Input

Output

Make the pomegranate at the right side of the basket lavender color.

Input

Output

Make the pigeon at the left side Cadet Blue.

Input

Output

Make the rabbit at the adult's hands copper color.

Input

Output

Replace the yellow "SCHOOL" sign with a red hospital sign, featuring a white hospital emblem on the top and the word "HOSPITAL" below.

Input

Output

Make the largest cup on the right white and smooth.

Citation

@article{yang2025videocof,
                title={Unified Video Editing with Temporal Reasoner},
                author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
                journal={arXiv preprint arXiv:2512.07469},
                year={2025}
                }