Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection

Image generated by Gemini AI
A study investigates the use of Multimodal Large Language Models (MLLMs) for audio deepfake detection, an area previously underexplored. By combining audio inputs with text prompts, researchers evaluated two models, Qwen2-Audio-7B-Instruct and SALMONN, in zero-shot and fine-tuned modes. Results indicate that while performance on out-of-domain data is lacking, the models excel on in-domain tasks with minimal supervision, suggesting a promising direction for enhancing audio deepfake detection.
Multi-modal Large Language Models Show Promise for Audio Deepfake Detection
Recent research into Multi-modal Large Language Models (MLLMs) has opened new avenues for audio deepfake detection. This study investigates the effectiveness of MLLMs by integrating audio inputs with text prompts to enhance detection capabilities.
The study focuses on two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, evaluating their performance in zero-shot and fine-tuned modes. The researchers employed an approach that combines audio data with text prompts to improve feature learning for audio deepfake detection.
Findings
The experiments revealed mixed results:
- Without task-specific training, the models exhibited poor performance in detecting audio deepfakes.
- With minimal supervision, the models showed significant efficacy in detecting in-domain audio deepfakes, suggesting that targeted training could enhance their capability.
The findings suggest that MLLMs can effectively detect audio deepfakes when trained on relevant data, but their performance heavily relies on the quality of the training process.
Related Topics:
📰 Original Source: https://arxiv.org/abs/2601.00777v1
All rights and credit belong to the original publisher.