We introduce V-Reason, a method that tunes the behavior of Large Multimodal Models during inference using entropy-based optimization, improving video reasoning accuracy and efficiency without reinforcement learning or supervised fine-tuning.
Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model's output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this "thinking" process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model's micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.
We see clear macro-exploration and macro-exploitation phases with bigger, more accurate models showing lower overall entropy (lower and later peak, followed by a lower final entropy during the macro-exploitation). We use these key insights to adapt a model’s behavior in a training-free way using an inference-time optimization technique.
Applying V-Reason on Qwen2.5-VL-7B-Instruct makes its entropy behave more similarly to the larger or the RL-trained Video-R1-7B model.
Our method achieves higher accuracy than the base LMM and bridges the accuracy gap with the RL model.
V-Reason also significantly reduces the total output tokens compared to all models due to a dedicated entropy minimization phase.
Proposed approach for enhancing video reasoning in a training-free manner using entropy-based objective. V-Reason uses an inference optimization method to modulate the values cache of the last decoder layer with an entropy switching loss (Lswitch) to further enhance the video reasoning performance.
An example output and comparison with the baseline Qwen-2.5-VL-7B and the RL-trained Video-R1 model.
@misc{sridhar2025vreason,
title={Video Reasoning Without Training},
author={Deepak Sridhar* and Kartikeya Bhardwaj* and Jeya Pradha Jeyaraj and Nuno Vasconcelos and Ankita Nayak and Harris Teague},
year={2025},
eprint={2510.17045},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.17045},
}