Examples
Prompt: Generate the motion that an apple falls straight down. The camera stays fixed throughout. Static shot.
We evaluate each sample under three controlled settings that differ only in the provided reasoning guidance. In the no-hint setting, models rely solely on the original instruction; the text-hint setting adds explicit textual reasoning steps; and the visual-hint setting (for visually demanding tasks) highlights relevant regions or directions in the image using annotations.
Embodied Reasoning
Seedance-1.0-fast

No Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup.

Visual Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup as indicated by red arrow.

Text Hint Prompt: Generate the correct trajectory for the gripper to pick up the green cup. The gripper should gradually move leftwards, grasp the green cup, and then lift it up.
4D Dynamics Reasoning
Seedance-1.0-fast

No Hint Prompt: Generate the motion that a banana above a container falls vertically downward.

Visual Hint Prompt: Generate the motion that a banana above a container falls vertically downward indicated by the arrow.

Text Hint Prompt: Generate the motion that a banana above a container falls vertically downward, lands on the top surface of the container, and falls down to rest while all other objects remain stationary.
Object Counting Reasoning
Seedance-1.0-fast

No Hint Prompt: Draw bounding boxes around each of the purple objects in the diagram to support counting the total number of cars.

Visual Hint Prompt: Draw bounding boxes around each of the purple objects in the diagram to support counting the total number of cars.

Text Hint Prompt: Draw two bounding boxes separately around purple motorcycle and purple sedan in the diagram to support counting the total number of cars. There is one cyan motorcycle and one purple motorcycle in the center and the left part of the image.
Visual Trace Reasoning
Seedance-1.0-fast

No Hint Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift via shortest path in the bottom-down area of the map, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift.

Visual Hint Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift via shortest path in the bottom-down area of the map indicated by the red arrow, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift.

Text Hint Prompt: Animate the elf moving step-by-step: go straight down by 1 cell, and then go right by 1 cell to the bottom-right cell toward the gift via shortest path, carefully avoiding the icy holes. End with the elf standing beside the gift.
Visual Detail Reasoning
Kling-v2.1

No Hint Prompt: Zoom in toward the lower part of the image, focusing on the person wearing a helmet who is walking past the bank entrance. Center the frame on the helmet and hold a steady close-up shot to clearly show its color. Draw a bounding box around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.

Visual Hint Prompt: Zoom in toward the lower part of the image, focusing on the person wearing a helmet who is walking past the bank entrance, as indicated by red bounding box. Center the frame on the helmet and hold a steady close-up shot to clearly show its color. Draw a bounding box around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.

Text Hint Prompt: Zoom in toward the lower part of the image, focusing on the man standing near the bank entrance wearing a helmet. Center the frame on the helmet and hold a steady close-up shot so the white color of the helmet is clearly visible. Draw a bounding box tightly around the helmet. Keep the surrounding area slightly blurred to emphasize the helmet.
MME-CoF-Pro is a benchmark of 303 samples across 16 categories for evaluating reasoning in video generative models. It compares three controlled settings—no hint, text hint, and visual hint—to isolate the effect of reasoning guidance. It further introduces Reasoning Score (RS), a process-level metric that evaluates correctness over intermediate reasoning steps.
Click to play the video
Prompt: Zoom in on the Apple logo. Hold a steady close-up shot so the Apple logo is clearly visible. Focus and concentrate on the logo's color.



Prompt: Gradually zoom in on the handbag. Keep the surrounding park and benches softly blurred to emphasize the handbag's color. Static shot.



Prompt: Choose picture 4 as a base. Fold the other 5 faces to form a cube, with folding edges clearly shown. Static camera perspective, no zoom or pan.



Prompt: Move the object up and rotate the object 90 degrees along z-axis. Static camera view, no zoom or pan, and the perspective of the object remains unchanged throughout.



Prompt: A red arrow points from the green chair toward the door. Another red arrow points from the door toward the balcony. Static camera view, no zoom or pan.



Prompt: An arrow points from the player wearing jersey number 10 in purpleto towards the basketball. Static camera view, no zoom or pan.



Prompt: Animate the elf starting from its initial position and proceeding step-by-step toward the gift in the bottom-down area of the map, avoiding the icy holes while showing clear directional movement across adjacent cells. End with the elf standing beside the gift. Static shot.



Prompt: Starting at the red dot in the top-left cell, animate moves to reach location B. Draw arrows for each step and finishing with a glow around the final cell. Static shot.



Prompt: Animate the ball reflecting at equal angles off the walls and landing near a numbered brick. Static shot.



Prompt: Animate the red ball moving along the blue arrow reflecting at equal angles off the walls and landing near a numbered brick. Static shot.



Prompt: In the figure shown, let 'n' represent the length of side AB of the inscribed rectangle ABCD, where n is an undetermined value. With BC equal to 6.0 and the diameter of circle O equal to 10.0. Generate an auxiliary line in order to calculate the value of n. The video ends once the connection process is complete. Static view, no zoom or pan.



Prompt: The length of the unit of the square is known. Draw a auxiliary line to calculate the length of the segment MO. The video ends once the connection process is complete. Static view, no zoom or pan.



Prompt: Start with a static, full view of the chart. Then, smoothly zoom the camera in to focus on the vertical area corresponding to the year 2014. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video. Static shot.



Prompt: Start with smoothly zooming in to focus on the 'Nova Scotia' row. Then, smoothly zoom out to the full view of the chart. End with smoothly zooming in to focus on the 'Manitoba' row. The chart itself, including all its data, lines, and labels, must remain completely static and unchanged throughout the video.



Prompt: A scanner dot moves along the black line. As soon as this dot enters a new grid square, that entire square is instantly filled with yellow color and stays yellow. A square only turns yellow if the scanner dot on the line has entered it. Static camera, no zoom.



Prompt: Highlight only the rectangles in the figure with a bright yellow color. Not highlight any other shapes like squares, triangles, circles, or irregular polygons. Static camera, no zoom, no pan.



Prompt: Click to scroll down through the list of years. Static shot.



Prompt: Click the calendar icon. Static shot.



Prompt: Rotate the scene certain degrees clockwise to make the person upright. Then draw bounding boxes around the frontmost skiing character. Static shot.



Prompt: Rotate the scene certain degrees clockwise to make the scene upright. Then draw a bounding box around the leftmost vending machine. Static shot.



Prompt: Show the full axial CT, then pan and zoom smoothly to examine which lobe contains the pulmonary nodule. Static shot.



Prompt: Show the full axial CT, and pan and zoom smoothly to examine the distribution pattern of stenotic segments. Static shot.



Prompt: Generate the correct trajectory for the gripper to place the spatula into the pot. Static shot.



Prompt:Generate the correct trajectory for the gripper to place the spoon to the right side of the table. Static shot.



Prompt: The camera records a static indoor scene where a toy car is placed on a wooden floor. Generate the motion that the car moves forward in a straight line. Static Shot.



Prompt: The camera shows a static tabletop scene on a wooden tray. Generate the motion that the red candle then moves slowly along the white arrow while all other objects remain stationary. The camera stays fixed to show the visibility change.



Prompt: The dropper releases several drops of brown iodine solution into the beaker. The camera holds as the liquid quickly turns deep blue-black and spreads uniformly with no bubbles or precipitate. Static shot.



Prompt: A gloved hand uses forceps to drop a small, silvery piece of sodium metal into a petri dish of water. The camera holds a steady close-up shot of the dish after the metal is released, clearly capturing what happens next. Static shot.



Prompt: Solve the problem step by step by generating the human hand in writing the solution on the whiteboard.



Prompt: Solve the problem step by step by generating the human hand in writing the solution on the whiteboard. Carefully follow the string transformation rules and show each intermediate step clearly.



Prompt: Show the board with several empty slots and multiple colored pieces. Place each piece into the correct slot based on its shape so that all openings are properly filled. Keep the board centered and clearly visible. Static shot.



Prompt: Show the tic-tac-toe grid with one move remaining. Complete the board by placing the blue piece in the correct empty cell to win the game. Keep the grid centered and clearly visible. Static shot.



High generation quality does not guarantee reasoning coherence. Some models (e.g., Kling) produce visually stunning videos but fail to capture the correct reasoning. Conversely, lower-fidelity outputs can still answer the question correctly.
Text hints generally improve Reasoning Score, but consistently degrade Consistency Score and cause hallucination, suggesting that explicit guidance may shift model attention rather than enhance genuine understanding.
Visual hints are more effective for structured and spatially guided tasks but are less reliable for fine-grained visual tasks. They also introduce hallucinations by being mistakenly rendered as part of the scene.
@article{qi2025mmecofpro,
title={MME-CoF-Pro: Evaluating Reasoning Coherence of Video Generative Models},
author={Yu Qi and Xinyi Xu and Ziyu Guo and Siyuan Ma and Renrui Zhang and Xinyan Chen and Ruichuan An and Ruofan Xing and Jiayi Zhang and Haojie Huang and Pheng-Ann Heng and Jonathan Tremblay and Lawson L. S. Wong},
journal={arXiv preprint arXiv:2603.20194v1},
year={2025}
}