VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Video Generation Models can produce temporally coherent visual trajectories, yet often fail to follow task-specific rules. We introduce a VLM-as-Teacher framework that synthesizes task-specific reward queries and guides a VGM Reasoner through online test-time optimization of a lightweight LoRA module.

Junhao Cheng^1† Liang Hou² Tianxiong Zhong² Xin Tao² Pengfei Wan² Kun Gai² Jing Liao^1✉

¹ City University of Hong Kong ² Kling Team, Kuaishou Technology

Paper arXiv Code (Under Review)

▧Teaser image

VLM Teacher → Online Optimization → VGM Reasoner

+16.7 Average performance gain

36/36 Tasks with consistent gains

25/30 Best open-source results on RULER-Bench

~16 Optimization steps on average

VLMs as Teachers

Instead of asking a VLM to write a textual solution, we use it to supervise whether a generated visual reasoning trajectory follows task-specific rules and achieves the intended goal.

◎Method figure

Process & Goal Queries

The VLM Teacher extracts task-specific rules and formulates one final-goal query together with process queries that verify valid intermediate trajectories.

Differentiable VQA Reward

For each query, the Teacher evaluates whether the current video prediction satisfies the requirement. The likelihood of answering “Yes” provides differentiable supervision.

Online LoRA Optimization

The VGM backbone remains frozen. At test time, VLM feedback updates only a lightweight LoRA module to improve visual execution for the specific reasoning instance.

Efficient by design. A step-distilled Reasoner, lightweight surrogate decoding, and loss-based early stopping keep online optimization practical.

Quantitative Results

The proposed online test-time optimization consistently outperforms sampling-based scaling and prompt-space refinement.

VBVR-Bench

Overall score ↑

0.781

0.666Wan2.2-5B-Distilled

0.683+ Pass@5

0.634+ VideoTPO

0.781+ Ours

RULER-Bench

Overall score ↑

68.2

46.4Wan2.2-5B-Distilled

49.1+ Pass@5

50.3+ VideoTPO

68.2+ Ours

Video Results

Qualitative comparison with state-of-the-art methods.

Symbolic Results

General-Purpose Results

Both Rewards Matter

Final-goal supervision ensures successful completion, while process supervision prevents invalid shortcuts and entity-inconsistent trajectories.

Goal and Process Supervision

Citation

@article{cheng2026vlmteacher,
    title   = {VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization},
    author  = {Cheng, Junhao and Hou, Liang and Zhong, Tianxiong and Tao, Xin and Wan, Pengfei and Liao, Jing},
    journal = {arXiv preprint},
    year    = {2026},
    url     = {https://arxiv.org/abs/2606.02564}, 
  }