Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

•

Original Author:Akanksha Chuchra et al.

•

January 2, 2026

Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Image generated by Gemini AI

A study investigates the use of Multimodal Large Language Models (MLLMs) for audio deepfake detection, an area previously underexplored. By combining audio inputs with text prompts, researchers evaluated two models, Qwen2-Audio-7B-Instruct and SALMONN, in zero-shot and fine-tuned modes. Results indicate that while performance on out-of-domain data is lacking, the models excel on in-domain tasks with minimal supervision, suggesting a promising direction for enhancing audio deepfake detection.

Multi-modal Large Language Models Show Promise for Audio Deepfake Detection

Recent research into Multi-modal Large Language Models (MLLMs) has opened new avenues for audio deepfake detection. This study investigates the effectiveness of MLLMs by integrating audio inputs with text prompts to enhance detection capabilities.

The study focuses on two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, evaluating their performance in zero-shot and fine-tuned modes. The researchers employed an approach that combines audio data with text prompts to improve feature learning for audio deepfake detection.

Findings

The experiments revealed mixed results:

Without task-specific training, the models exhibited poor performance in detecting audio deepfakes.
With minimal supervision, the models showed significant efficacy in detecting in-domain audio deepfakes, suggesting that targeted training could enhance their capability.

The findings suggest that MLLMs can effectively detect audio deepfakes when trained on relevant data, but their performance heavily relies on the quality of the training process.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Multi-modal Large Language Models Show Promise for Audio Deepfake Detection

Findings

Related Topics:

Share this article