2.8m - Gmail.txt

: The model is tested on subsets ranging from 200k to 2.8 million samples.

The paper demonstrates that MSRL significantly outperforms pure SFT models by optimizing for both textual structure and visual fidelity, effectively surpassing the performance limit reached at 2.8M SFT samples [11, 25]. MSRL Stage Max Dataset Size 2.8 million samples [11, 22] 33k curated samples [11] GPU Requirement 16 H800 GPUs [11] 24 H800 GPUs [11] Training Goal Min. Negative Log-Likelihood [22] Hybrid Text-Visual Reward [11] Outcome Performance Plateaus [22] Breaks SFT Performance Limit [11] 2.8M GMAIL.txt

: Qwen2.5-VL-72B-Instruct is used as the judge model for calculating visual rewards during training [11]. 4. Experimental Results : The model is tested on subsets ranging from 200k to 2

: Increasing data from 2M to 2.8M results in no further performance gains, confirming the plateau [22]. Multimodal Structured Reinforcement Learning (MSRL) : Multimodal Structured Reinforcement Learning (MSRL) : : The

: The SFT stage requires 60 hours of training on 16 H800 GPUs . The RL stages take an additional 34 hours on 24 H800 GPUs [11].

: Uses 11k pairs with a balance of textual and visual rewards (