Detect Multi Spoken Languages Using Bidirectional Long Short-Term Memory

Abstract


1.INTRODUCTION
Humans are currently the world's most accurate language detection system, and humans can tell if a language is their mother tongue within seconds of hearing it. If it is a language, they are unfamiliar with, they can often make subjective comparisons with a language they are comfortable with to explicate concealed knowledge [1]. Out of the many deep neural network techniques available, this research focuses on one technique of recurrent neural network called Bidirectional Long Short-Term Memory (BiLSTM) and TensorFlow library to build and train a deep neural network model to detect the speaker language from recorded audio files [2]. The problem is how to build language detection systems should use an acoustic model to detect the language regardless of gender, accents, or pronunciations. The aim of the research is to build an efficient intelligent computer system to detect the speaker's language from audio files using the best methods for extracting features and creating the audio file using the MFCCs algorithm. As far as we know, no researcher used Arabic or Kurdy languages in previous researches and has been able to obtain a detection accuracy 100% between two languages (Arabic and English) and the accuracy is 99.19% among three languages (Arabic, English, Kurdy) where samples that used include both sexes. The structure of this paper is as follows: Section 2 presents some related works language detection methods. Section 3 briefly explains features extraction using Mel-Frequency Cepstral Coefficients (MFCC). Section 4 briefly explains the Bidirectional Long Short-Term Memory (BiLSTM) algorithm that used in proposed model. Section 5 describes the details of the proposed system. Section 6. Section presents the details of result of the proposed model. Finally, Section 7 presents the conclusions.

Related Works
Research in recent years has dealt with the process of detecting speaker languages around the world, and many studies have been conducted on the subject, and the findings of previous researchers have been summarized as follows: • In 2016, the researcher Ruben Zazo and others proposed an automatic language recognition system using a long-term memory (LSTM) algorithm to classify between eight languages (English, Spanish, Dari, French, Pashto, Russian, Urdu and Mandarin Chinese). Data was obtained to record about 200 hours of broadcasts of the Voice of America news channel, and the researchers used the MFCCs algorithm to extract the features. The proposed system reached an accuracy of 50% if the sample length was half a second, while the accuracy of the system was 70% if the sample length was three seconds [3]. • In 2019, the researcher Andreas Lindgren used the convolutional neural network (CNN) algorithm to classify two languages (English and French). The researcher used the MFCCs to extract the features where the number of coefficients that were used was (5,6,7,8,10,13,17), through which the researcher obtained different results depending on the number of transactions, as the best classification accuracy obtained by the researcher was (92.03%) when the number of transactions used was (13). The researcher relied on VoxForge to obtain the audio files for the classification and testing process [4]. • In 2019 researchers Shauna Revay and Matthew Teschke used a language identification for audio spectrograms (LIFAS), which are spectrograms of raw audio signals as input to a convolutional neural network (CNN) used for language identification. And the proposed method can use short audio clips (about 4 seconds) for effective classification, where audio samples were obtained from (VoxForge.org) site. The accuracy of classification between two languages was 97%, while the accuracy of classification between six languages (English, German, Italian, French, Spanish and Russian) reached 89%. [5]. • In 2020, Lucas Rafael and Arnaldo Candido proposed an automatic language identification model obtained through a convolutional neural network (CNN) trained on audio spectrograms on languages (Portuguese, English and Spanish). The sample length for the sounds used in the system is five seconds per sample. The proposed model was able to identify the suggested languages with an accuracy of 96.8% on a data set within the used database, while the system obtained an accuracy of 83% on new test data [6]. • In 2021, researcher Herman Groenbroek introduced a methodology he called (VGGish) and used it to classify between six languages (English, Dutch, German, French, Spanish and Portuguese) for a music song data set called (6L5K Music Corpus). The audio part dataset is obtained by taking 3-second audio portions of the musical ensemble (6L5K) and comparing the proposed methodology (VGGish) with the Deep Neural Network. Adjectives were extracted using tone-degree coefficients (MFCC).
The results indicate that language discrimination of music songs in a deep neural network (DNN) had a training accuracy of 35% for six languages. While the proposed system (VGGish) obtained a training accuracy of 41% in the same six-class data set. When using these systems on test data, the accuracy of the deep neural network (DNN) was 18.1%, while the accuracy of the proposed system (VGGish) was 35.2% [7].

Features Extraction
The process of pattern recognition for all types of data needs to understand this data in a simplified way by deriving only useful and not redundant values or features, which facilitates the steps of any computer system. These important properties can be transformed into a feature matrix such that the feature matrix contains the relevant information from the input data to perform the required task using this better representation instead of the complete raw data [8].
To classify any incoming signal, some attributes are extracted from it. The set of extracted D features is represented as a Ddimensional vector shape C=[C1,C2,……,CD]T called the feature vector. The main point is that the selected features must contain valuable information to distinguish correctly, as the features must measure the characteristics of the signal that have values that allow it to be distinguished between different sound classes [9].

Mel Frequency Cepstral Coefficients (MFCCs)
The vast majority of speaker language detection systems today, as well as many classification algorithms, make use of features based on either Mel Frequency Cepstral Coefficients (MFCCs) or features based on perceptual linear predictive analysis (PLP) of speech. MFCCs are a compressed representation of the audio signal spectrum that takes into account the nonlinear human perception of pitch, to extract MFCCs Fast Fourier transform bins are combined according to a set of Triangular Weighting Functions that approximate human perception of pitch. Spectrum filtering is represented using Filter bank of triple band filters, then apply discrete cosine transform (DCT) and get MFCCs [9]. The human peripheral auditory system provides the basis for MFCCs. Humans do not perceive the frequency content of the speech signals on a linear scale. Thus, subjective pitch is evaluated on a scale called the Mel Scale for each tone with an actual frequency measured in Hertz. The slope scale uses a logarithmic frequency spacing of less than 1000 Hz and a linear frequency spacing of more than 1 kHz. A 1 kHz tone, 40 dB above the sensorineural threshold, is defined as 1,000 miles as the reference point [10], as shown below in Figure (1) [11].
The common formula for conversion from frequency scale to Mel scale is [10]: where f mel is the frequency in Mel and f Hz is the normal frequency in hertz.
As shown in Figure (2), MFCCs consist of seven computational steps. Each step has its own function and its own mathematical method as shown below [10]:

Pre-emphasis
This step deals with the process of passing the signal through the filter (a High Pass Filter), which pre-emphasizes the speech input signal to optimize the high-frequency portion of the signal at the time of speech generation [12]: Y(n) represents the output signal after the preemphasis operation.

Framing
The process of segmenting the obtained audio samples into small frames with a length of 20-40 milliseconds. The speech signal is divided into N frames. The adjacent frames are separated by M such that (M<N). Generally, values used are M = 100 and N = 256. With an optional overlap equal to half or a third of the frame size in order to facilitate the transition from one frame to another [10].

Hamming Window
This step aims to create the window in each individual frame to reduce signal interruption at the beginning and end of each frame, the Hamming window is used as the shape of the window. Hemming window equation [10]: If we define the window as W(n),0≤n≤ N-1 where N is the number of samples in each frame. Therefore, the result of creating the window can be displayed based on the following equation [10]: Y(n) = the output signal.
Here, the Hamming window is more commonly used as the window shape in speech recognition technology, and all the closest frequency lines are combined by looking at the next block in the feature extraction processing chain. The impulse response of the Hamming window is shown according to the following equation [11]: For this reason, the Hamming window is used to extract MFCCs, which reduces the signal value towards zero at the window boundary and avoids discontinuities [10].

Fast Fourier Transform
An algorithm that computes the Discrete Fourier Transform that converts the signal from its original domain (usually time or space) to a representation in the frequency domain and the

Mel Filter Bank Processing
The frequency range in the Fast Fourier transform spectrum is very wide and the audio signal does not follow a linear scale. The filter bank is operated according to the mil scale as shown in Figure 6 [4].
The figure 6 shows a set of trigonometric filters used to calculate the weighted sum of the spectral components of the filter so that the result of the process is approximated to a slope scale. The amplitude frequency response of each filter is triangular, equal to (1) at the center frequency, and decreases linearly to zero at the center frequency of two adjacent filters. So, the output of each filter is the sum of its filtered spectral components. Then, the following equation is used to calculate a slope for a given frequency (f) [12]:

Discrete Cosine Transform
The speech spectrum representation provides a good representation of the local spectral properties of the signal. For a given frame analysis, we transform the spectrum of pitch energies into the time domain using the Discrete Fourier Transform. The result is called pitch coefficients (MFCC). The set of coefficients is called the acoustic vector. As shown in the following equation [13]: where n = 1,2,……,k, while S k ,k = 1,2,……, k are the outputs of the last step.

Delta Energy and Delta Spectrum
The power is related to the identity of the sound. The energy in the signal frame x in the window from time sample t1 to time sample t2 is expressed by the following equation [14]: Energy= ∑ ( ) ---------------------------(9) Also, the audio signal is not constant from one frame to the next. This is an important fact about changes in audio signal and frame, which can provide useful clues for language detection. For this reason, we also add properties related to changes in MFCCs over time. A distinctive double boost or speed up feature has been added to each of the 13 attributes (12 Mel characteristics plus energy). Each of the 13 delta features represents the change between frames in the corresponding pitch/energy feature, and each of the 13 double delta features represents the change between frames in the corresponding delta feature. An easy way to calculate delta is to calculate the difference between tires; Therefore, delta d(t) for pitch value c(t) at time t can be estimated as [14]: Each of the 13 delta features represents the change between frames in equation (2-10) corresponding to the tone feature or energy, while each of the 39 double delta features represents the change between frames in the corresponding delta features. The performance of MFCCs can be affected by slope frequency by two components, the first is the number of filters and the second is the window type [14].

Bidirectional Long Short-Term Memory (BiLSTM)
The long-term short-term memory (LSTM) algorithm, which is an upgrade of recurrent neural networks (RNN), was introduced by Hochreiter and Schmidhuber in 1997, to solve the problems and drawbacks of recurrent neural networks by adding additional interactions for each unit (or cell). The LSTM algorithm is a special type of recurrent neural network (RNN), capable of learning long-term dependencies and remembering information for long periods of time [15]. BiLSTM algorithms are hybrid algorithms between LSTM and Bi-Directional Recurrent Neural Networks (Bi-RNN). Both the Recurrent Neural Network (RNN) and the Long-Term Memory Network (LSTM) can only obtain information from the previous context. To get rid of this problem, the Bi-Directional Recurrent Neural Network (BiRNN) was found, as the (BiRNN) consists of two different layers that receive the input data separately in two different directions [16]. The idea of BiLSTM comes from BiRNN, in which data sequences in both forward and backward directions are processed by two separate hidden layers. The BiLSTM network connects the two hidden layers to the same output layer. The output from the front layer is computed iteratively using the inputs in a forward sequence H tk , from time t k−n to t k−1 , and the output from the reverse layer H tk is computed, using the inputs in the reverse sequence from t k−1 to t k−n . The final output of each BiLSTM layer is calculated according to the following equation: Where ψ is a sequential function that sums the outputs of both the forward layer and the reverse layer [17].
The following figure shows a model of a bidirectional longterm short-term memory (BiLSTM) network [18].

 Activation Functions
The activation function determines whether a neuron should be activated or not. It provides a non-linear output to the neurons. A neural network without activation functions is just a Linear Regression Model. There are many activation functions, some of which we will touch on [4].

1-Sigmoid Function
The sigmoid function transforms the input, which can have any value between positive infinity and negative infinity, to a reasonable value in the range 0 to 1 [19]. Where it can be expressed by the following equation:

2-Hyperbolic Tangent function (Tanh)
This function is similar to the sigmoid function, but the range of the hyperbolic tangent function (TanH) ranges from -1 to 1, unlike the sigmoid function which has an output range from 0 to 1, where it can be optimized better than the sigmoid function [20]. It can be expressed by the following equation:

Proposed System
The system was designed in successive stages the proposed system includes four basic stages:  First stage: preparing the database of the speakers' audio files.  Second stage: the preprocessing of the audio files.  Third stage: extracting features using the MFCCs algorithm.  Fourth stage: building the classifier and detecting the speaker's language using bidirectional long-term memory algorithms. Python (3.10.1) was used to build the system, as it is an easyto-learn and open-source programming language and a laptop computer with an Intel CORi7 5th Gen processor and 8GB of RAM was used.

Preparing Dataset
Databases are the basis of any language classification system and form the infrastructure of a computer communication system. The first challenge of this research is how to find a data set of audio clips in different languages large enough to train a network. And it relied on the database (M2L-Dataset [21]), which is in three languages (Arabic, English and Kurdish) and at a rate of one thousand samples for each of the three languages of the type (.WAV) and at a sample rate of (22050), the Arabic language samples consist of audio recordings. For 40 people of both sexes, an average of 25 samples per person recorded by using Mobiles in quite rooms. As for the Kurdish language, it was obtained from some lectures and lessons, as it was cut and processed according to the database in order to obtain a sufficient number of samples for training and testing. As for the English language, samples for this language were obtained from VoxForge (VoxForge. url: voxforge.org)), which is an open source consisting of audio clips for Middle Ages and both sexes users in different languages.

Pre-processing
Data pre-processing is an essential step for speaker language recognition, as it ensures that the data is well prepared for certain types of analysis. In this paper, the Pre-processing was carried out in two steps: • Remove periods of silence. Using the function (split_on_silence) provided by (pydub.silence) library in python.
• Unifying the length of files by one or two seconds. If the length of the audio file is more than the required period, the excess part will be deducted, but if the length of the file is less than the required length, the audio file will be repeated until it reaches the required length.

Feature Extraction
In this paper, MFCCs algorithm were used in order to extract the features, which are stored in the form of dictionary file (fields of dictionary is Mapping and Labels and MFCCs) using the libraries (librosa and json). The values used in the process of extracting features are shown in the following table:

Detection of Speaker Language and Testing
The proposed model used is the BiLSTM (BiLSTM) network for speaker language detection, which is described as a Sequential Model built using Keras library. In this model, the form of the input that the model should estimate must be determined, so the input layer requires information about the shape of the input, while the rest of the layers work on automatic network inference. The form of the entry will be a binary matrix (39,44) or (39,87), which represents (39) characteristics for each frame, The structure of the proposed model is shown in the following table: Where the number (2 or 3) in output layer refers to number of languages.

Results and discussion
The scale that was relied upon in this research is Accuracy. As for ensuring the validity of the system, it was relied on evaluation criteria, which are Precision, Recall, and F1 Score. These values can be obtained from the confusion matrix as shown in the following table [8]:  The following table shows the final results for speaker language detection accuracy. The following table values the evaluation criteria used in the system for test of four cases of speaker language detection.

Conclusion
The goal of building a model capable of distinguishing between two languages with an accuracy of 100% and the accuracy of the model to distinguish among three languages was 99.19% when the sample length was two second, using one LSTM layer and five BiLSTM layers. The best setup in terms of signal processing was the use of MFCCs and the use of 39 filter banks if implemented in a voice control application, a sample length of 1 or 2 seconds suggested. The proposed system also concluded that a better result would have been possible if we had had more computational power since the tests that took too long resulted in fewer tests. The use of M2L-Dataset was very useful because the audio material used was of good quality, finally, it would be interesting for future projects to consider implementing more languages.