VLM-as-Teacher

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Video Generation Models can produce temporally coherent visual trajectories, yet often fail to follow task-specific rules. We introduce a VLM-as-Teacher framework that synthesizes task-specific reward queries and guides a VGM Reasoner through online test-time optimization of a lightweight LoRA module.

1 City University of Hong Kong 2 Kling Team, Kuaishou Technology
Paper arXiv Code (Under Review)
Teaser figure
Teaser image
VLM Teacher → Online Optimization → VGM Reasoner
+16.7 Average performance gain
36/36 Tasks with consistent gains
25/30 Best open-source results on RULER-Bench
~16 Optimization steps on average

VLMs as Teachers

Instead of asking a VLM to write a textual solution, we use it to supervise whether a generated visual reasoning trajectory follows task-specific rules and achieves the intended goal.

VLM-as-Teacher framework overview
Method figure
01

Process & Goal Queries

The VLM Teacher extracts task-specific rules and formulates one final-goal query together with process queries that verify valid intermediate trajectories.

02

Differentiable VQA Reward

For each query, the Teacher evaluates whether the current video prediction satisfies the requirement. The likelihood of answering “Yes” provides differentiable supervision.

03

Online LoRA Optimization

The VGM backbone remains frozen. At test time, VLM feedback updates only a lightweight LoRA module to improve visual execution for the specific reasoning instance.

Efficient by design. A step-distilled Reasoner, lightweight surrogate decoding, and loss-based early stopping keep online optimization practical.

Quantitative Results

The proposed online test-time optimization consistently outperforms sampling-based scaling and prompt-space refinement.

VBVR-Bench

Overall score ↑

0.781
0.666
0.683
0.634
0.781

RULER-Bench

Overall score ↑

68.2
46.4
49.1
50.3
68.2

Video Results

Qualitative comparison with state-of-the-art methods.

Both Rewards Matter

Final-goal supervision ensures successful completion, while process supervision prevents invalid shortcuts and entity-inconsistent trajectories.

Citation

@article{cheng2026vlmteacher,
    title   = {VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization},
    author  = {Cheng, Junhao and Hou, Liang and Zhong, Tianxiong and Tao, Xin and Wan, Pengfei and Liao, Jing},
    journal = {arXiv preprint},
    year    = {2026},
    url     = {https://arxiv.org/abs/2606.02564}, 
  }