Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Published in AAAI 2026, 2026

This paper revisits multimodal data sampling strategies for reinforcement learning based post-training from a difficulty-distinguish perspective.

We propose two complementary difficulty metrics:

PISM: Progressive Image Semantic Masking, effective for perception-intensive tasks
CMAB: Cross-Modal Attention Balance, effective for reasoning-intensive tasks

Based on these metrics, we design a hierarchical post-training framework supporting both GRPO-only and SFT+GRPO paradigms, enabling effective fusion of perception and reasoning abilities in multimodal large models.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)