Small Vocabulary
Speech Recognition for
Chhattisgarhi
Presented by:
Aman Kumar Seth (CSE - 16100009)
Himanshu Singh (CSE - 16100030)
Mayank Kumar Giri (CSE - 16101032)
Content
Introduction
Data Collection
Feature Extraction
Algorithm and Approches
GUI
Conclusion
References
Introduction
Objective - To make an Automatic Speech
Recognition system for Chhattisgarhi language.
Current focus - Recognizing small vocabulary of
Chhattisgarhi with a decent accuracy.
Data Collection
The audio samples were recorded in the .wav’ format.
It is an uncompressed format unlike Mp3.
Recordings are reproduced without any loss in the audio quality.
At a sampling rate of 16 kHz
It contains more information than a signal sampled at 8kHz.
Disadvantage of higher sampling rate is the presence of additional noise.
The audio format set to “mono”
Stereo format is used for creating a cinematic experience for the perception of depth in
sound.
In this, voice of the speaker is amplified and the other noises are minimized
Contd...
Initial Approach
Selecting the volunteers for recording of 20 words.
Voice data (words) were collected manually using phone
And then trimming the silence part from the word
This turned to be inefficient for collecting large data.
Smarter Approach
Completely automated the data collection process for large dataset.
Instead of words, collected sentences.
Used Python code for trimming and naming the audio files.
Database Architecture
Containing two folders
Training Folder
Sub-folders of each word.
Audio files of 20 speakers for training.
For labeling, direct folder name can be used
Testing Folder
Same as training, only it contains few audio files.
Feature Extraction
Feature extraction is the process of transforming the input data or signal into a
set of features which can represent the data well.
Processing techniques using which the audio signals can be converted into
numerical feature sets which can then be used for training our machine learning
models.
Pyhton’s librosa library is used for feature extraction.
Contd...
The following features were extracted for the audio samples:
MFCC
Mel Spectrogram
Chroma - stft
Tonnetz
Spectral - Contrast
MFCC (Mel-Frequency Cepstral Coefficients)
Cepstrum is the result of taking the inverse fourier transform of the logarithm of the
estimated spectrum of the signal
The power cepstrum in particular finds applications in the analysis of human speech.
In MFC , equal spacing of the frequency bands takes place which approximates the
human auditory system’s response more closely.
The MFCC function of librosa return a 1D matrix of size 40
Mel Spectrogram
A spectrogram is the visual representation of the spectrum of frequencies of sound or
other signal.
Also known as Voicegrams, Sonographs or Voiceprints.
From nonlinear mel scale of frequency, we obtain the Mel Spectrogram.
Mel Scale is a scale of pitches judged by listeners to be equal in distance from one
another.
Chroma - stft
The term Chromagram or Chroma feature closely relates to the 12 different classes of
pitch.
Also referred to as pitch class profiles.
This is a powerful tool for the analysis of music whose pitches can be meaningfully
categorized.
The main property of chromagram is that they capture the melodic and harmonic
characteristics of music while being robust and agile to changes in instrumentation and
timbre
In this we usually calculate a chromagram from the waveform of the power spectrum
Tonnetz
This feature detects the changes in the harmonic content of the musical audio signals.
A peak in the detection function represents that a transition was made from one
harmonically stable region to another.
This algorithm can successfully detect harmonic changes such as chord boundaries in
the polyphonic audio recordings.
Spectral Contrast
Octave based Spectral Contrast considers,
Spectral peaks
Spectral valleys
and their difference in each sub-band.
It roughly represents the relative distribution of the harmonic and non-harmonic
components in the spectrum.
Like MFCC, it takes the average of the spectral distribution in each sub-band and are thus
prone to lose valuable spectral information.
Characteristics of Voices
Main characteristics of the human voice that can be used to uniquely identify it are,
Loudness
It is the magnitude of the change in the air pressure.
the Mel Spectrogram is used to represent the loudness of an audio signal.
Pitch
It is the frequency that tells us the number of times a pressure pattern is repeated per unit time.
Chroma Stft , Spectral Contrast and Tonnetz all tell us about the pitch of the audio samples.
Timbre
General term for the distinguishable characteristics of a tone.
Quality of sound that makes voices sound different from each other.
Mainly determined by the Harmonic content of a sound and the Dynamic characteristics.
Tonnetz gives us an idea about the timbre of the voice
Algorithms and Approaches
Due to lack of huge data, instead of NN, we are
recognizing using MFCC (Mel Frequency Cepstral
Coefficients) and DTW (Dynamic Time Warping),
with the help of some classifier.
Block Diagram
Label
Machine
Learning
Algorithm
Audio Files
MFCC DTW
(a) Training
(b) Testing
MFCC DTW
Features
Features
Classifier
Model
Audio Sample
Label
DTW (Dynamic Time Warping)
Dynamic Time Warping is one of the algorithms for measuring similarity
between two temporal sequences, which may vary in speed.
References:
[1] A Review on Different Approaches for Speech Recognition System
[2] Speech Recognition using MFCC
[3] Speech recognition using MFCC and DTW
Procedure
After data collection and preprocessing, we use MFCC feature extraction to get
the feature vectors for each sample.
We use these feature vectors to apply DTW to calculate distance or error value of
each sample with all others.
Among this, we find the class having minimum DTW distance from our sample to
be predicted, that becomes the output.
Precisions of 58 classes in
DTW
Precisions
[ 80, 28, -1, 100, 66, 100, 75, 100, 100, 100, 80, 66, 100, 66, 100, 50,
100, 100, 100, 50, 27, 100, 100, 80, 40, 100, 100, 50, 100, 100, 100, 100,
100, 50, 100, 42, 100, 100, 80, 75, 100, 57, 50, 75, 100, 100, 66, 100,
66, 100, 50, 100, 100, 100, 66, 80, 100, -1]
Using Classifiers for
Recognition
Initial approach was used as the initial audio samples were quite small in number(800).
As the size of our dataset increased to a significant number (2300+) we switched to the machine
learning approach.
Where various models were trained on the feature set.
And the test samples were classified into one of the 58 target classes.
1. SVM
Support Vector Machine constructs a hyperplane or set of hyperplanes in a high or
infinite-dimensional space.
Which can be used for Classification, Regression, or other tasks like Outliers Detection.
A good separation is achieved by the hyperplane that has the largest distance to the
nearest training-data point of any class (so-called functional margin).
Precisions of 58 classes in
SVM
Precisions
[33.34, 40.0, 16.67, 66.67, 50.0, 50.0, 100.0, 100.0, 40.0, 20.0, 100.0, 33.34,
66.67, 0.0, 33.34, -1, 50.0, 100.0, 100.0, 50.0, 33.34, 100.0, -1, 33.34, 50.0,
57.14285714285714, 75.0, 50.0, 50.0, 100.0, 66.67, 50.0, 100.0, 33.34, 25.0,
33.34, 40.0, 50.0, 33.34, 50.0, 33.34, 33.34, 33.34, 37.5, 75.0, -1, 0.0, 37.5,
33.34, 33.34, 100.0, 66.67, 100.0, 100.0, 50.0, 100.0, 100.0, 0.0]
2. Random Forest
Random forests or random decision forests are an ensemble learning method for classification
and regression.
The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the
“bagging” method.
Advantage of random forest is, that it can be used for both classification and regression
problems.
It adds additional randomness to the model.
No. of trees vs Accuracy
Precisions of 58 classes in
Random Forest
Precisions
[100.0, -1, -1, 66.67, 100.0, 100.0, 100.0, 100.0, 100.0, 42.85, 50.0,
50.0, 40.0, 66.67, 26.67, 100.0, 16.67, 57.14285714285714, 100.0,
57.14, 33.34, 75.0, 66.67, 40.0, 100.0, 60.0, 66.67, 33.34, 100.0,
66.67, 50.0, 75.0, 44.45, -1, 100.0, -1, 33.34, -1, 0.0, 66.67, 37.5,
50.0, -1, 33.34, 57.14, -1, 0.0, 0.0, 0.0, 57.14, 40.0, 100.0, 20.0, 50.0,
66.67, 100.0, 100.0, 33.34]
Contd...
In Random Forest, only a random subset of the features is taken into consideration by the
algorithm for splitting a node.
In our dataset the random tree produced satisfactory results as the maximum approach that was
achieved was 71%.
This could have increased further by increasing the size of the training dataset.
3. Artificial Neural Network
Neural Network itself is not an algorithm, but rather a framework for many different machine
learning algorithms to work together and process complex data inputs.
ANN is based on a collection of connected units or nodes called artificial neurons.
In ANN implementations, the signal at a connection between artificial neurons is a real number,
and the output of each artificial neuron is computed by some non-linear function of the sum of
its inputs.
Contd...
The basic requirement of this approach was the availability of a huge dataset (minimum 30,000
audio samples) therefore this approach did not produce good results.
We only had 2300 samples and even after bootstrapping the minimum threshold could not be
satisfied and so was not included in the final model.
Epochs vs Accuracy
Precisions of 58 classes in
ANN
Precisions
[-1, 25.0, 0.0, 37.5, 60.0, 100.0, 33.34, 100.0, 28.57, 100.0, 40.0,
33.34, 0.0, 50.0, -1, 20.0, 50.0, 33.34, 60.0, 66.67, -1, 11.11, 100.0,
50.0, 50.0, 37.5, 40.0, 40.0, 66.67, -1, 100.0, 0.0, 10.52, -1, 50.0, -1,
16.67, 25.0, 0.0, 0.0, 66.67, 30.76, 60.0, 25.0, 25.0, -1, 0.0, 50.0,
100.0, 40.0, 100.0, 100.0, 42.85, 100.0, 0.0, 60.0, 80.0, -1]
Comparison of all Accuracies
Windows Application
We have created a multi-threaded Windows
application with the following features:
English / Hindi Speech to text conversion using
the Google speech API.
Small vocabulary chhattisgarhi word
recognition from the input audio file selected
For programming the user interface we have used
the Tkinter package in Python.
Accuracy on the 5-word Dataset
So far, we have worked on 5 words. We collected data
from 20 speakers, each word twice.
Trained on 180 samples out of these 200 (18 speakers)
Tested on the left 20 samples (2 speakers), in which we
got 85% accuracy (17/20 correctly recognized).
Accuracy on the 20-word Dataset
So far, we have worked on 20 words. We collected data
from 20 speakers, each word twice.
Trained on 720 samples (18 speakers) out of the total of
800 samples.
Tested on the left 80 samples (2 speakers), in which we
got 87.5% accuracy (70/80 correctly recognized).
Accuracy on the 58-word Dataset
So far, we have worked on 58 words. We collected data
from 20 speakers, each word twice.
Trained on 2208 samples (18 speakers) out of the total
of 2320 samples.
Tested on the left 232 samples (2 speakers), in which
we got 75.43% accuracy (175/232 correctly
recognized).
Graphical User
Interface (GUI)
Developed a Windows
application
using Python