Small Vocabulary

Speech Recognition for

Chhattisgarhi

Presented by:

Aman Kumar Seth (CSE - 16100009)

Himanshu Singh (CSE - 16100030)

Mayank Kumar Giri (CSE - 16101032)

Content

● Introduction

● Data Collection

● Feature Extraction

● Algorithm and Approches

● GUI

● Conclusion

● References

Introduction

Objective - To make an Automatic Speech

Recognition system for Chhattisgarhi language.

Current focus - Recognizing small vocabulary of

Chhattisgarhi with a decent accuracy.

Data Collection

❖ The audio samples were recorded in the ‘.wav’ format.

➢ It is an uncompressed format unlike Mp3.

➢ Recordings are reproduced without any loss in the audio quality.

❖ At a sampling rate of 16 kHz

➢ It contains more information than a signal sampled at 8kHz.

➢ Disadvantage of higher sampling rate is the presence of additional noise.

❖ The audio format set to “mono”

➢ Stereo format is used for creating a cinematic experience for the perception of depth in

sound.

➢ In this, voice of the speaker is amplified and the other noises are minimized

Contd...

❖ Initial Approach

➢ Selecting the volunteers for recording of 20 words.

➢ Voice data (words) were collected manually using phone

➢ And then trimming the silence part from the word

➢ This turned to be inefficient for collecting large data.

❖ Smarter Approach

➢ Completely automated the data collection process for large dataset.

➢ Instead of words, collected sentences.

➢ Used Python code for trimming and naming the audio files.

Database Architecture

❖ Containing two folders

➢ Training Folder

■ Sub-folders of each word.

■ Audio files of 20 speakers for training.

■ For labeling, direct folder name can be used

➢ Testing Folder

■ Same as training, only it contains few audio files.

Feature Extraction

❖ Feature extraction is the process of transforming the input data or signal into a

set of features which can represent the data well.

❖ Processing techniques using which the audio signals can be converted into

numerical feature sets which can then be used for training our machine learning

models.

❖ Pyhton’s librosa library is used for feature extraction.

Contd...

❖ The following features were extracted for the audio samples:

➢ MFCC

➢ Mel Spectrogram

➢ Chroma - stft

➢ Tonnetz

➢ Spectral - Contrast

MFCC (Mel-Frequency Cepstral Coefficients)

❖ Cepstrum is the result of taking the inverse fourier transform of the logarithm of the

estimated spectrum of the signal

❖ The power cepstrum in particular finds applications in the analysis of human speech.

❖ In MFC , equal spacing of the frequency bands takes place which approximates the

human auditory system’s response more closely.

❖ The MFCC function of librosa return a 1D matrix of size 40

Mel Spectrogram

❖ A spectrogram is the visual representation of the spectrum of frequencies of sound or

other signal.

❖ Also known as Voicegrams, Sonographs or Voiceprints.

❖ From nonlinear mel scale of frequency, we obtain the Mel Spectrogram.

❖ Mel Scale is a scale of pitches judged by listeners to be equal in distance from one

another.

Chroma - stft

❖ The term Chromagram or Chroma feature closely relates to the 12 different classes of

pitch.

❖ Also referred to as pitch class profiles.

❖ This is a powerful tool for the analysis of music whose pitches can be meaningfully

categorized.

❖ The main property of chromagram is that they capture the melodic and harmonic

characteristics of music while being robust and agile to changes in instrumentation and

timbre

❖ In this we usually calculate a chromagram from the waveform of the power spectrum

Tonnetz

❖ This feature detects the changes in the harmonic content of the musical audio signals.

❖ A peak in the detection function represents that a transition was made from one

harmonically stable region to another.

❖ This algorithm can successfully detect harmonic changes such as chord boundaries in

the polyphonic audio recordings.

Spectral Contrast

❖ Octave based Spectral Contrast considers,

➢ Spectral peaks

➢ Spectral valleys

and their difference in each sub-band.

❖ It roughly represents the relative distribution of the harmonic and non-harmonic

components in the spectrum.

❖ Like MFCC, it takes the average of the spectral distribution in each sub-band and are thus

prone to lose valuable spectral information.

Characteristics of Voices

Main characteristics of the human voice that can be used to uniquely identify it are,

❖ Loudness

➢ It is the magnitude of the change in the air pressure.

➢ the Mel Spectrogram is used to represent the loudness of an audio signal.

❖ Pitch

➢ It is the frequency that tells us the number of times a pressure pattern is repeated per unit time.

➢ Chroma Stft , Spectral Contrast and Tonnetz all tell us about the pitch of the audio samples.

❖ Timbre

➢ General term for the distinguishable characteristics of a tone.

➢ Quality of sound that makes voices sound different from each other.

➢ Mainly determined by the Harmonic content of a sound and the Dynamic characteristics.

➢ Tonnetz gives us an idea about the timbre of the voice

Algorithms and Approaches

Due to lack of huge data, instead of NN, we are

recognizing using MFCC (Mel Frequency Cepstral

Coefficients) and DTW (Dynamic Time Warping),

with the help of some classifier.

Block Diagram

Label

Machine

Learning

Algorithm

Audio Files

MFCC DTW

(a) Training

(b) Testing

MFCC DTW

Features

Classifier

Model

Audio Sample

Label

DTW (Dynamic Time Warping)

Dynamic Time Warping is one of the algorithms for measuring similarity

between two temporal sequences, which may vary in speed.

References:

[1] A Review on Different Approaches for Speech Recognition System

[2] Speech Recognition using MFCC

[3] Speech recognition using MFCC and DTW

Procedure

❖After data collection and preprocessing, we use MFCC feature extraction to get

the feature vectors for each sample.

❖ We use these feature vectors to apply DTW to calculate distance or error value of

each sample with all others.

❖ Among this, we find the class having minimum DTW distance from our sample to

be predicted, that becomes the output.

Precisions of 58 classes in

DTW

Precisions

[ 80, 28, -1, 100, 66, 100, 75, 100, 100, 100, 80, 66, 100, 66, 100, 50,

100, 100, 100, 50, 27, 100, 100, 80, 40, 100, 100, 50, 100, 100, 100, 100,

100, 50, 100, 42, 100, 100, 80, 75, 100, 57, 50, 75, 100, 100, 66, 100,

66, 100, 50, 100, 100, 100, 66, 80, 100, -1]

Using Classifiers for

Recognition

❖ Initial approach was used as the initial audio samples were quite small in number(800).

❖ As the size of our dataset increased to a significant number (2300+) we switched to the machine

learning approach.

❖ Where various models were trained on the feature set.

❖ And the test samples were classified into one of the 58 target classes.

1. SVM

❖ Support Vector Machine constructs a hyperplane or set of hyperplanes in a high or

infinite-dimensional space.

❖ Which can be used for Classification, Regression, or other tasks like Outliers Detection.

❖ A good separation is achieved by the hyperplane that has the largest distance to the

nearest training-data point of any class (so-called functional margin).

Precisions of 58 classes in

SVM

Precisions

[33.34, 40.0, 16.67, 66.67, 50.0, 50.0, 100.0, 100.0, 40.0, 20.0, 100.0, 33.34,

66.67, 0.0, 33.34, -1, 50.0, 100.0, 100.0, 50.0, 33.34, 100.0, -1, 33.34, 50.0,

57.14285714285714, 75.0, 50.0, 50.0, 100.0, 66.67, 50.0, 100.0, 33.34, 25.0,

33.34, 40.0, 50.0, 33.34, 50.0, 33.34, 33.34, 33.34, 37.5, 75.0, -1, 0.0, 37.5,

33.34, 33.34, 100.0, 66.67, 100.0, 100.0, 50.0, 100.0, 100.0, 0.0]

2. Random Forest

❖ Random forests or random decision forests are an ensemble learning method for classification

and regression.

❖ The “forest” it builds, is an ensemble of Decision Trees, most of the time trained with the

“bagging” method.

❖ Advantage of random forest is, that it can be used for both classification and regression

problems.

❖ It adds additional randomness to the model.

No. of trees vs Accuracy

Precisions of 58 classes in

Random Forest

Precisions

[100.0, -1, -1, 66.67, 100.0, 100.0, 100.0, 100.0, 100.0, 42.85, 50.0,

50.0, 40.0, 66.67, 26.67, 100.0, 16.67, 57.14285714285714, 100.0,

57.14, 33.34, 75.0, 66.67, 40.0, 100.0, 60.0, 66.67, 33.34, 100.0,

66.67, 50.0, 75.0, 44.45, -1, 100.0, -1, 33.34, -1, 0.0, 66.67, 37.5,

50.0, -1, 33.34, 57.14, -1, 0.0, 0.0, 0.0, 57.14, 40.0, 100.0, 20.0, 50.0,

66.67, 100.0, 100.0, 33.34]

Contd...

❖ In Random Forest, only a random subset of the features is taken into consideration by the

algorithm for splitting a node.

❖ In our dataset the random tree produced satisfactory results as the maximum approach that was

achieved was 71%.

❖ This could have increased further by increasing the size of the training dataset.

3. Artificial Neural Network

❖ Neural Network itself is not an algorithm, but rather a framework for many different machine

learning algorithms to work together and process complex data inputs.

❖ ANN is based on a collection of connected units or nodes called artificial neurons.

❖ In ANN implementations, the signal at a connection between artificial neurons is a real number,

and the output of each artificial neuron is computed by some non-linear function of the sum of

its inputs.

Contd...

❖ The basic requirement of this approach was the availability of a huge dataset (minimum 30,000

audio samples) therefore this approach did not produce good results.

❖ We only had 2300 samples and even after bootstrapping the minimum threshold could not be

satisfied and so was not included in the final model.

Epochs vs Accuracy

Precisions of 58 classes in

ANN

Precisions

[-1, 25.0, 0.0, 37.5, 60.0, 100.0, 33.34, 100.0, 28.57, 100.0, 40.0,

33.34, 0.0, 50.0, -1, 20.0, 50.0, 33.34, 60.0, 66.67, -1, 11.11, 100.0,

50.0, 50.0, 37.5, 40.0, 40.0, 66.67, -1, 100.0, 0.0, 10.52, -1, 50.0, -1,

16.67, 25.0, 0.0, 0.0, 66.67, 30.76, 60.0, 25.0, 25.0, -1, 0.0, 50.0,

100.0, 40.0, 100.0, 100.0, 42.85, 100.0, 0.0, 60.0, 80.0, -1]

Comparison of all Accuracies

Windows Application

We have created a multi-threaded Windows

application with the following features:

▷ English / Hindi Speech to text conversion using

the Google speech API.

▷ Small vocabulary chhattisgarhi word

recognition from the input audio file selected

For programming the user interface we have used

the Tkinter package in Python.

Accuracy on the 5-word Dataset

● So far, we have worked on 5 words. We collected data

from 20 speakers, each word twice.

● Trained on 180 samples out of these 200 (18 speakers)

● Tested on the left 20 samples (2 speakers), in which we

got 85% accuracy (17/20 correctly recognized).

Accuracy on the 20-word Dataset

● So far, we have worked on 20 words. We collected data

from 20 speakers, each word twice.

● Trained on 720 samples (18 speakers) out of the total of

800 samples.

● Tested on the left 80 samples (2 speakers), in which we

got 87.5% accuracy (70/80 correctly recognized).

Accuracy on the 58-word Dataset

● So far, we have worked on 58 words. We collected data

from 20 speakers, each word twice.

● Trained on 2208 samples (18 speakers) out of the total

of 2320 samples.

● Tested on the left 232 samples (2 speakers), in which

we got 75.43% accuracy (175/232 correctly

recognized).

Graphical User

Interface (GUI)

Developed a Windows

application

using Python