Skip to content

The official GitHub page for ''Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning''

License

Notifications You must be signed in to change notification settings

AoiDragon/VIPER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Beyond the Last Frame: Process-aware Evaluation for
Generative Video Reasoning



Illustration of outcome-hacking, where the generated video has the correct final state but an incorrect process.


πŸ‘€ Overview

Current video generation models often suffer from Outcome-hacking: they may generate a video with the correct final outcome but a wrong process. This hacks traditional single-frame evaluation metrics.

VIPER (VIdeo Process Evaluation for Reasoning) is designed to bridge this gap:

  • πŸ† Comprehensive Benchmark: 309 carefully curated samples spanning 6 distinct domains (Temporal, Structural, Symbolic, Spatial, Physics, and Planning).
  • πŸ“ New Metric (POC@r): Process-Outcome Consistency. We evaluate correctness at both the process and outcome levels by uniformly sampling frames at rate $r$.
  • 🚫 Failure Pattern: We identify and summarize four common failure patterns in current generative video models.


Overview of VIPER. VIPER consists of 16 tasks from 6 domains

πŸ“Š Dataset Statistics

VIPER covers diverse reasoning tasks to ensure a holistic evaluation of video generation capabilities.

Domain Samples Task Types
Physics 32 experiment, game
Planning 44 navigation, manipulation
Spatial 60 rotate, restore
Structural 70 chess, maze, sudoku
Symbolic 60 math, multimodal
Temporal 43 obj_move, zoom

πŸš€ Quick Start

Download

from datasets import load_dataset

# Load the full VIPER benchmark
dataset = load_dataset("Monosail/VIPER")

Data Fields

  • id: Unique identifier for the sample
  • domain: The reasoning domain (Physics, Planning, Spatial, Structural, Symbolic, Temporal)
  • task_type: Specific task category within the domain
  • prompt: Text prompt describing the task
  • image: The input image
  • reference_frames: Ground-truth image frames
  • reference_texts: Ground-truth text descriptions
  • protocol: Process-level task constraints

πŸ› οΈ Evaluation (Coming Soon)

πŸ“ Citation

If you find our benchmark useful for your research, please consider citing:

@article{li2026viper,
  title={Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning},
  author={Li, Yifan and Gu, Yukai and Min, Yingqian and Liu, Zikang and Du, Yifan and Zhou, Kun and Yang, Min and Zhao, Wayne Xin and Qiu, Minghui},
  journal={arXiv preprint arXiv:2512.24952},
  year={2025}
}

About

The official GitHub page for ''Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning''

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published