Audio Classification

Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.

Audio Classification Model

About Audio Classification

Use Cases

Command Recognition

Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.

As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:

'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'

Speechbrain models can easily perform this task with just a couple of lines of code!

from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(

Language Identification

Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example modeltrained on VoxLingua107.

Emotion recognition

Emotion recognition is self explanatory. In addition to trying the widgets, you can use the Inference API to perform audio classification. Here is a simple example that uses a HuBERT model fine-tuned for this task.

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/superb/hubert-large-superb-er"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]

You can use huggingface.js to infer with audio classification models on Hugging Face Hub.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.audioClassification({
  data: await (await fetch("sample.flac")).blob(),
  model: "facebook/mms-lid-126",  

Speaker Identification

Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with this model. A useful dataset for this task is VoxCeleb1.

Solving audio classification for your own data

We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. Facebook's Wav2Vec2 XLS-R model is a large multilingual model trained on 128 languages and with 436K hours of speech.

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!


Scripts for training


Compatible libraries

Audio Classification demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Audio Classification
Browse Models (0)

No example model is defined for this task.

Note Contribute by proposing a model for this task !

Datasets for Audio Classification
Browse Datasets (0)

No example dataset is defined for this task.

Note Contribute by proposing a dataset for this task !

Spaces using Audio Classification

No example Space is defined for this task.

Note Contribute by proposing a Space for this task !

Metrics for Audio Classification