Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

•

Original Author:Dingkun Zhang et al.

•

February 3, 2026

Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Image generated by Gemini AI

Researchers have developed DualSpeed, a framework to enhance the training efficiency of Multimodal Large Language Models (MLLMs) by addressing the inefficiencies related to massive model sizes and visual tokens. DualSpeed uses a dual-mode approach: a fast mode that employs Visual Token Pruning (VTP) to reduce visual tokens, and a slow mode that trains on full sequences for consistency. This method significantly accelerates training—2.1x for LLaVA-1.5 and 4.0x for LLaVA-NeXT—while maintaining over 99% performance. Code is available on GitHub.

New Framework Enhances Training Efficiency for Multimodal Large Language Models

Researchers have unveiled a novel framework called DualSpeed that significantly improves the training efficiency of Multimodal Large Language Models (MLLMs). This approach addresses inefficiencies associated with massive model sizes and the number of visual tokens that have hindered training processes.

Current methods typically focus on reducing model sizes or limiting trainable parameters. However, Visual Token Pruning (VTP) faces challenges when applied during training, leading to a mismatch between training and inference processes.

DualSpeed Framework

The DualSpeed framework operates on a dual-mode system. The fast-mode integrates existing VTP techniques to minimize the number of visual tokens and includes a mode isolator to enhance training efficiency. The slow-mode serves as an auxiliary training phase where the model is exposed to complete visual sequences, ensuring consistency between training and inference. This mode employs self-distillation, allowing the model to learn from the better-trained fast-mode.

Performance Gains

Initial experiments demonstrate that the DualSpeed framework accelerates training times without sacrificing model performance. Specifically, LLaVA-1.5's training has been expedited by a factor of 2.1 and LLaVA-NeXT by 4.0, maintaining over 99% of the models' original performance metrics.

Developers and researchers interested in exploring this framework can access the code on GitHub: DualSpeed on GitHub.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

New Framework Enhances Training Efficiency for Multimodal Large Language Models

DualSpeed Framework

Performance Gains

Related Topics:

Share this article