MME-CoF-Pro

Evaluating Reasoning Coherence of Video Generative Models

*Equal contribution.  Project Lead.  Corresponding author.
1Northeastern University 2The Chinese University of Hong Kong 3Westlake University 4ByteDance Seed 5Peking University 6NVIDIA

Reasoning Coherence

We define reasoning coherence as the extent to which generated events follow consistent and plausible cause-effect dynamics across frames. To quantify this property, we introduce a Reasoning Score (RS), a human-annotated metric that evaluates whether necessary intermediate reasoning steps are correctly preserved.

Examples

Prompt: Generate the motion that an apple falls straight down. The camera stays fixed throughout. Static shot.

Veo3.1-fast
Reasoning Score: 71.4%
1 2 3 4 5 6 7
Cosmos-Predict-2
Reasoning Score: 28.6%
1 2 3 4 5 6 7
Kling-v2.1
Reasoning Score: 57.1%
1 2 3 4 5 6 7
Reasoning Score Rubric
1. An apple appears above the container and begins falling straight downward.
2. The apple follows a continuous downward trajectory toward the container opening.
3. The apple reaches the container and collides with the apples inside.
4. A visible interaction occurs between the falling apple and the apples in the container.
5. The falling apple is deflected and does not remain inside the container.
6. The apple exits the container and lands on the table.
7. The apples originally inside the container remain inside after the interaction.

No Hint & Text Hint & Visual Hint

We evaluate each sample under three controlled settings that differ only in the provided reasoning guidance. In the no-hint setting, models rely solely on the original instruction; the text-hint setting adds explicit textual reasoning steps; and the visual-hint setting (for visually demanding tasks) highlights relevant regions or directions in the image using annotations.

Embodied Reasoning

Seedance-1.0-fast

No Hint

No Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup.

Reasoning Score: 66.7%
1 2 3 4 5 6
Visual Hint

Visual Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup as indicated by red arrow.

Reasoning Score: 33.3%
1 2 3 4 5 6
Text Hint

Text Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup. The gripper should gradually move leftwards, grasp the green cup, and then lift it up.

Reasoning Score: 50.0%
1 2 3 4 5 6
Reasoning Score Rubric
1. The gripper starts from its initial position above the dishwasher rack and aligns horizontally with the green cup located near the upper-left area of the rack.
2. The trajectory moves smoothly leftward toward the green cup, avoiding contact with the surrounding clear glasses and the rack wires.
3. The gripper positions itself directly above the green cup's opening, centered over the cup body.
4. The gripper moves vertically downward in a controlled manner until it reaches grasping height around the upper rim/body of the green cup.
5. The gripper closes once to securely grasp the green cup without knocking over adjacent cups or glasses.
6. After grasping, the gripper lifts the green cup upward along a smooth vertical path, clearly separating it from the rack.

4D Dynamics Reasoning

Seedance-1.0-fast

No Hint

No Hint Prompt: Generate the motion that a banana above a container falls vertically downward.

Reasoning Score: 33.3%
1 2 3 4 5 6
Visual Hint

Visual Hint Prompt: Generate the motion that a banana above a container falls vertically downward indicated by the arrow.

Reasoning Score: 50.0%
1 2 3 4 5 6
Text Hint

Text Hint Prompt: Generate the motion that a banana above a container falls vertically downward, lands on the top surface of the container, and falls down to rest while all other objects remain stationary.

Reasoning Score: 33.3%
1 2 3 4 5 6
Reasoning Score Rubric
1. The banana begins falling straight downward from its initial position.
2. The banana follows a continuous vertical trajectory toward the box opening.
3. The banana aligns with and passes through the opening of the box.
4. The banana fully enters the box without crossing the horizontal boundary of the opening.
5. The banana comes to rest inside the box.
6. No part of the banana remains above the box opening after it settles.

Object Counting Reasoning

Seedance-1.0-fast

No Hint

No Hint Prompt: Draw bounding boxes around each of the purple objects in the diagram to support counting the total number of cars.

Reasoning Score: 40.0%
1 2 3 4 5
Visual Hint

Visual Hint Prompt: Draw bounding boxes around each of the purple objects in the diagram to support counting the total number of cars.

Reasoning Score: 60.0%
1 2 3 4 5
Text Hint

Text Hint Prompt: Draw two bounding boxes separately around purple motorcycle and purple sedan in the diagram to support counting the total number of cars. There is one cyan motorcycle and one purple motorcycle in the center and the left part of the image.

Reasoning Score: 20.0%
1 2 3 4 5
Reasoning Score Rubric
1. Bounding boxes appear over the purple objects in the figure.
2. The bounding boxes are placed on the purple motorcycle and the purple sedan.
3. Each vehicle is highlighted with a separate bounding box.
4. The bounding boxes align tightly with the boundaries of the purple motorcycle and the purple sedan.
5. The total number of bounding boxes is 2.

Visual Trace Reasoning

Seedance-1.0-fast

No Hint

No Hint Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift via shortest path in the bottom-down area of the map, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift.

Reasoning Score: 25.0%
1 2 3 4
Visual Hint

Visual Hint Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift via shortest path in the bottom-down area of the map indicated by the red arrow, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift.

Reasoning Score: 100%
1 2 3 4
Text Hint

Text Hint Prompt: Animate the elf moving step-by-step: go straight down by 1 cell, and then go right by 1 cell to the bottom-right cell toward the gift via shortest path, carefully avoiding the icy holes. End with the elf standing beside the gift.

Reasoning Score: 100%
1 2 3 4
Reasoning Score Rubric
1. The elf starts at the correct cell on the map.
2. The first movement goes down exactly three cells.
3. The second movement goes right exactly one cell.
4. The elf reaches the location where the gift avoids the icy holes.

Visual Detail Reasoning

Kling-v2.1

No Hint

No Hint Prompt: Zoom in toward the lower part of the image, focusing on the person wearing a helmet who is walking past the bank entrance. Center the frame on the helmet and hold a steady close-up shot to clearly show its color. Draw a bounding box around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.

Reasoning Score: 0%
1 2 3
Visual Hint

Visual Hint Prompt: Zoom in toward the lower part of the image, focusing on the person wearing a helmet who is walking past the bank entrance, as indicated by red bounding box. Center the frame on the helmet and hold a steady close-up shot to clearly show its color. Draw a bounding box around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.

Reasoning Score: 33.3%
1 2 3
Text Hint

Text Hint Prompt: Zoom in toward the lower part of the image, focusing on the man standing near the bank entrance wearing a helmet. Center the frame on the helmet and hold a steady close-up shot so the white color of the helmet is clearly visible. Draw a bounding box tightly around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.

Reasoning Score: 33.3%
1 2 3
Reasoning Score Rubric
1. Successfully zoom in on the helmet worn by the person near the bank entrance.
2. Successfully draw an accurate bounding box tightly enclosing the helmet.
3. Hold a clear, steady close-up shot where the helmet is fully visible and its color (white) can be easily identified.

MME-CoF-Pro Benchmark

MME-CoF-Pro is a benchmark of 303 samples across 16 categories for evaluating reasoning in video generative models. It compares three controlled settings—no hint, text hint, and visual hint—to isolate the effect of reasoning guidance. It further introduces Reasoning Score (RS), a process-level metric that evaluates correctness over intermediate reasoning steps.

Figure 1
Figure 2
Figure 3

Visualization

Click to play the video

Visual Detail Reasoning

Prompt: Zoom in on the Apple logo. Hold a steady close-up shot so the Apple logo is clearly visible. Focus and concentrate on the logo's color.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Gradually zoom in on the handbag. Keep the surrounding park and benches softly blurred to emphasize the handbag's color. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

3D Geometry Reasoning

Prompt: Choose picture 4 as a base. Fold the other 5 faces to form a cube, with folding edges clearly shown. Static camera perspective, no zoom or pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Move the object up and rotate the object 90 degrees along z-axis. Static camera view, no zoom or pan, and the perspective of the object remains unchanged throughout.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Real-world Spatial Reasoning

Prompt: A red arrow points from the green chair toward the door. Another red arrow points from the door toward the balcony. Static camera view, no zoom or pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: An arrow points from the player wearing jersey number 10 in purpleto towards the basketball. Static camera view, no zoom or pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Visual Trace Reasoning

Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift in the bottom-down area of the map, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Starting at the red dot in the top-left cell, animate moves to reach location B. Draw arrows for each step and finishing with a glow around the final cell. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Physics-based Reasoning

Prompt: Animate the ball reflecting at equal angles off the walls and landing near a numbered brick. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Animate the red ball moving along the blue arrow reflecting at equal angles off the walls and landing near a numbered brick. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

2D Geometry Reasoning

Prompt: In the figure shown, let 'n' represent the length of side AB of the inscribed rectangle ABCD, where n is an undetermined value. With BC equal to 6.0 and the diameter of circle O equal to 10.0. Generate an auxiliary line in order to calculate the value of n. The video ends once the connection process is complete. Static view, no zoom or pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: The length of the unit of the square is known. Draw a auxiliary line to calculate the length of the segment MO. The video ends once the connection process is complete. Static view, no zoom or pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Table & Chart Reasoning

Prompt: Start with a static, full view of the chart. Then, smoothly zoom the camera in to focus on the vertical area corresponding to the year 2014. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Start with smoothly zooming in to focus on the 'Nova Scotia' row. Then, smoothly zoom out to the full view of the chart. End with smoothly zooming in to focus on the 'Manitoba' row. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Object Counting Reasoning

Prompt: A scanner dot moves along the black line. As soon as this dot enters a new grid square, that entire square is instantly filled with yellow color and stays yellow. A square only turns yellow if the scanner dot on the line has entered it. Static camera, no zoom.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Highlight only the rectangles in the figure with a bright yellow color. Not highlight any other shapes like squares, triangles, circles, or irregular polygons. Static camera, no zoom, no pan.

Veo-3
Kling-v2.1
Cosmos-Predict-2

GUI Reasoning

Prompt: Click to scroll down through the list of years. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Click the calendar icon. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Rotation Reasoning

Prompt: Rotate the scene certain degrees clockwise to make the person upright. Then draw bounding boxes around the frontmost skiing character. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Rotate the scene certain degrees clockwise to make the scene upright. Then draw a bounding box around the leftmost vending machine. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Medical Reasoning

Prompt: Show the full axial CT, then pan and zoom smoothly to examine which lobe contains the pulmonary nodule. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Show the full axial CT, and pan and zoom smoothly to examine the distribution pattern of stenotic segments. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Embodied Reasoning

Prompt: Generate the correct trajectory for the gripper to place the spatula into the pot. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt:Generate the correct trajectory for the gripper to place the spoon to the right side of the table. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

4D Dynamics Reasoning

Prompt: The camera records a static indoor scene where a toy car is placed on a wooden floor. Generate the motion that the car moves forward in a straight line. Static Shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: The camera shows a static tabletop scene on a wooden tray. Generate the motion that the red candle then moves slowly along the white arrow while all other objects remain stationary. The camera stays fixed to show the visibility change.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Natural Science Domain

Prompt: The dropper releases several drops of brown iodine solution into the beaker. The camera holds as the liquid quickly turns deep blue-black and spreads uniformly with no bubbles or precipitate. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: A gloved hand uses forceps to drop a small, silvery piece of sodium metal into a petri dish of water. The camera holds a steady close-up shot of the dish after the metal is released, clearly capturing what happens next. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Text-based Reasoning

Prompt: Solve the problem step by step by generating the human hand in writing the solution on the whiteboard.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Solve the problem step by step by generating the human hand in writing the solution on the whiteboard. Carefully follow the string transformation rules and show each intermediate step clearly.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Visual Logical Reasoning

Prompt: Show the board with several empty slots and multiple colored pieces. Place each piece into the correct slot based on its shape so that all openings are properly filled. Keep the board centered and clearly visible. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Prompt: Show the tic-tac-toe grid with one move remaining. Complete the board by placing the blue piece in the correct empty cell to win the game. Keep the grid centered and clearly visible. Static shot.

Veo-3
Kling-v2.1
Cosmos-Predict-2

Takeaways

Takeaway 1: Reasoning Coherence shows no clear correlation with Generation Quality

High generation quality does not guarantee reasoning coherence. Some models (e.g., Kling) produce visually stunning videos but fail to capture the correct reasoning. Conversely, lower-fidelity outputs can still answer the question correctly.

Kling-v2.1 — Source: kling_1.mp4
kling_1
Prompt: Gradually zoom in on the group of people walking along the path, centering on the person carrying the handbag. Keep the surrounding park and benches softly blurred to emphasize the handbag's color. Static shot.
Compared with Veo 3.1 Fast, Kling-v2.1 generates a more realistic scene of wind blowing the trees, but it completely misses focus on the bag and its colors.
Veo-3 — Source: veo_3.1_fast_1.mp4
veo_3.1_fast_1
Prompt: Gradually zoom in on the group of people walking along the path, centering on the person carrying the handbag. Keep the surrounding park and benches softly blurred to emphasize the handbag's color. Static shot.
In contrast, Veo 3.1 Fast pays better attention to the bag and color details, but the wind-driven tree motion is less realistic.
Kling-v2.1
kling_2
Prompt: Generate the cursor movement to click the camera shutter button. Keep the shot static and perform a single precise click without triggering unintended actions or altering other interface.
Although Kling does a very good job in zooming the phone, it fails to perform the click action.
Sora-2 — Source: sora_2.mp4
sora_2
Prompt: Generate the cursor movement to click the camera shutter button. Keep the shot static and perform a single precise click without triggering unintended actions or altering other interface.
Sora 2 performs the click motion perfectly.

Takeaway 2: Text Hint improves Reasoning Score but introduces hallucination

Text hints generally improve Reasoning Score, but consistently degrade Consistency Score and cause hallucination, suggesting that explicit guidance may shift model attention rather than enhance genuine understanding.

Sora-2 — Elf
No Hint
no hint elf
Text Hint
text hint elf

No Hint Prompt: Generate the motion that the elf starting from its initial position and move step-by-step to reach the gift. Static shot.

Text Hint Prompt: Generate the motion that the elf starting from its initial position and move 3 cells down, then move 1 cell right to reach the gift. Static shot.

Text hint introduces additional ice hole compared with no hint setting.
Veo-3.1 — Robotic gripper
No Hint
no hint gripper
Text Hint
text hint gripper

No Hint Prompt: Static shot of the scene. The robotic gripper grasps the silver pot while all other objects remain unchanged and stationary.

Text Hint Prompt: Static shot of the scene. The robotic gripper moves smoothly toward the silver pot, descends vertically to align with the pot’s rim or handle, gently closes to grasp it securely, and lifts straight upward while avoiding contact with nearby objects.

At the end of the video, the text hint changes the silver pot into a silver pan and introduces an additional handle.

Takeaway 3: Visual Hints help structued tasks but hurt fine-grained perception and induce hallucination

Visual hints are more effective for structured and spatially guided tasks but are less reliable for fine-grained visual tasks. They also introduce hallucinations by being mistakenly rendered as part of the scene.

Sora-2 — Chart zoom
No Hint
no hint chart
Visual Hint
visual hint chart

No Hint Prompt: Start with smoothly zooming in to focus on the 'Nova Scotia' row. Then, smoothly zoom out to the full view of the chart. End with smoothly zooming in to focus on the 'Manitoba' row. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video.

Visual Hint Prompt: Start with smoothly zooming in to focus on the 'Nova Scotia' row, as indicated by red bounding box. Then, smoothly zoom out to the full view of the chart. End with smoothly zooming in to focus on the 'Manitoba' row. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video.

Visual bounding box benefits visual grounding.
Veo-3.1 — Red arrow
Visual Hint
visual hint arrow

Visual Hint Prompt: A wooden box with several books sits on a table. Generate the motion that an apple falls freely from its current position indicated as the red arrow. Static shot.

Red arrow is rendered as part of the scene.

BibTeX

@article{qi2025mmecofpro,
  title={MME-CoF-Pro: Evaluating Reasoning Coherence of Video Generative Models},
  author={Yu Qi and Xinyi Xu and Ziyu Guo and Siyuan Ma and Renrui Zhang and Xinyan Chen and Ruichuan An and Ruofan Xing and Jiayi Zhang and Haojie Huang and Pheng-Ann Heng and Jonathan Tremblay and Lawson L. S. Wong},
  journal={arXiv preprint arXiv:2603.20194v1},
  year={2025}
}