The advancements in biometric systems using various modalities recognize a particular human on behavioural and physiological traits in faster and efficient manner. The quantity of studies in regards to recognition systems for face and speech modalities have been expanding every year. People can comprehend and imagine different emotions consistently. This should be possible by seeing different elements like movements of facial muscles, voice, hand signals, and so on
The face biometric modality is mostly used and powerful among all types of biometric traits as samples of face pictures are gained utilizing nonintrusive technique and with no collaboration of an individual. The technique DTCWT (Dual Tree Complex Wavelet Transform) is a novel enhancement strategy of DWT. It’s anything but viable
strategy for implementing an analytical wavelet transform. The complicated coefficients created by DTCWT presents restricted excess and permits the change in providing directional selectivity and shift invariance of filters.
The technique DTCWT could be carried out making parallel use of 2D divisible two real wavelets transform. The principal real wavelet transform can be implemented making use of high pass and low pass filter coefficients H_{0}(k) and H_{1}(k) applied along the row and column dimensions of 2D information that makes structure of DTCWT with Upper Filter bank.
The second Real wavelet transform that describes Lower Filter bank of DTCWT could be developed by making use of high pass and low pass filter coefficients G_{0} (k) and G_{1}(k) that are around logical to coefficients of upper filter bank bringing the results in ideal reproduction of incoming information of images. The QFT (Quick Fourier Transform) is a faster calculation algorithm for Discrete Fourier Transform utilized in applications of signal processing such as Correlation analysis, Linear Filtering and spectrum analysis which includes higher time for computation, that results in moderate efficient algorithms. In QFT, the sequence of data is divided into smaller sequences until we are able to get sequences of singlepoint. Considering N = 2s, such decompositions can be computed s = log_{2} N intervals. Hence, the total count of complicated multiplications decreased to (N/2) log_{2} N versus N2 complicated multiplication of straight forward calculation of DFT. In same way, the count of complicated additions is decreased to N log_{2} N considered to N2 – N complicated additions of directly DFT calculation.
There is good amount of work carried out in the field of speech and face recognition systems by various researchers. For the detection of active speaker, the works done in the paper
The work in article
The historical activities of technologies for recognition of face modality, the present status oftheart techniques, and directions for the future are discussed
The acoustic extraction approach improvement dependent on a hybrid procedure comprising of Perceptual Wavelet Packet (PWP) and Mel Frequency Cepstral Coefficients (MFCC) is presented
The various available and created data sets used for face and speech recognitions are discussed in this section.
Spacek Face database: This database created by Libor Spacek
Extended Yale Face Database B +: Extended Yale Face Database B+
Near Infrared Face Database:
ORL (Olivetti Research Lab) database:
Data sets for speech recognition system, the LibriSpeech corpus is used
In this section, proposed methodology for face and speech recognition system is discussed. The extracted features of QFT and DTCWT are fused to generate final face feature set. The extracted feature of MFCC and RASTA are fused to get final results as final speech feature set. The model proposed concentrate on enhancing the recognition rates for both face and speech modalities. The diagram illustrating the proposed model is given in
The various face databases contain numerous dimensions in the face; therefore, the images may be processed into uniform sized images. Every image is processed to 2p × 2q where p and q are integer variables. The images of face are processed to size of 128 × 512. The algorithms DTCWT and QFT are applied to those images which are resized.
The Feature Extraction of Two Dimensional DTCWT follows the steps given below:
Firstly, an image given as input is made to decompose using twodimensional DWT. Our model proposed applies five stage DTCWT on images of face, that gives sixteen sub levels at every stage, four sub levels having lower frequencies and twelve sub levels having higher frequencies. In each stage, size of the image is decreased to 50% of original size of the image, that is, in to 4 x 16 image size.
Secondly, each two respective sub levels that contain the similar pass levels are linearly joined by differencing and averaging. Resultantly, the sub levels of twodimensional CWT in every stage are computed as (SPx+SPy)/√2, (SPx − SPy)/ √2, (PSx +PSy)/√2, (PPx − PPy)/√2, (PPx + PPy)/√2.
Thirdly, For the recognitions of the face features, magnitudes of real levels and imaginary levels are considered. Twodimensional DWT on every decomposition obtains three higher frequency levels such as PS, SP and PP that provides a directional data. The DTCWT developed making use of twodimensional real wavelet transform generates six complicated wavelets generating directional data on various directions by considering real part and imaginary part of every complicated wavelet. The magnitude of a set of six complicated wavelets are computed by the equations given below, Equations 1 and 2 and the finalized magnitude coefficients are produces by applying the concatenation operation as given in the Equation 3.
In these equations, x_{p}, x_{q}, x_{y}, x_{z}_{ }are_{ }respective to DTCWT coefficient vectors of higher frequency having size 1 × 192 in 5Stage DTCWT. The resultant vector X of features 1 x 384 generated by applying the concatenation operation to the magnitudes x_{pq}_{ }and_{ }x_{yz.}
For the Quick Fourier Transform (QFT) feature extraction, apply 2D QFT on the Preprocessed images of face having size 128 × 512 in order to obtain coefficients of QFT by making use of the Equation 4. The absolute values of QFT coefficient are sorted in nonincreasing order. Highly dominant 384 coefficients are decided to consider as features of QFT. Dominant features of QFT are fused with DTCWT features by applying arithmetic addition operation to obtain resultant features, to get improved recognition rate for a person. The resultant features of fusion are generated making use of the Equation 5. In order to get the features of test images, the QFT and a fivestage DTCWT are computed on test images. These features are analysed with features of database pictures applying Euclidean distance given in Equation. 6, to produce values of FAR, FRR, TSR and EER.
In the above equation, F(p,q) represent QFT Coefficient and f(k,i) represent input face image.
The FFT_{n} and DTCWT_{n}_{ }provides coefficients of Quick Fourier Transform and DTC wavelets.
In Eqn.6, the x_{k} represents feature value for images from database, and y_{k} represents feature value for test images.
The voice data from the speech database are preprocessed, before we apply the MFCC and RASTA techniques for the feature extractions.
The scale of MelFrequency is a lower frequency which is linear within 1000 Hz and logarithmic higher frequency more than 1000 Hz. Equation 7 provides the relationship of Mel scales to frequency in Hz.
In above Equation 7, FRmel is Mel scale and f is frequency in Hz. The procedure in Mel scale to the frequency spectrum with the working function of ear of person as a filter is via Filter Bank. Suppose F[N] spectrum is input for the process, and then output is M[N] spectrum that is the F[N] modified spectrum which has Power Output of those filters. The spectrum coefficient of Mel is specially determined to be 20. With respect to the Cepstrum, the persons listen to speech information depending on time domain signals. In this particular phase, Melspectrum would be transformed into time domain by making use of Discrete Cosine Transform (DCT). And the outcome would be MFCC. The cosine transformations expressed using the equation given in Equation 8.
The computation at Equation 8, provide X_{i }as the coefficient of MFCC, Z_{i} is Mel frequency power spectrum, i = 1, 2, 3, …, N, where N is the count of coefficients desired and M is presented as count of filters.
The specialized levelpass filter joined to each frequency sublevels in MFCC algorithm to make it smooth out shorter noise variations and to reduce any constant disturbances in the speech channel. As shown in the
The experimentation of face recognition system in proposed model is done using MATLAB. The various set of face images are used from the databases discussed in sectionIII, such as Spacek Face database, Extended Yale Face Database B+, Near Infrared Face Database and ORL database along with the Indian male and Indian female databases.
The parameters for the performance analysis, experimental results by making use of DTCWT, QFT, and fusion of these techniques are analysed. The numerous combinations of Human inside database (HID) and Human outside database (HOD) of each database is used to know the variations in the performance parameters.
The performance parameters such as False Acceptation Ratio (FAR), False Rejection Ratio (FRR), Total Success Rate (TSR), Partial Error Rate (PER), Equal Error Rate (EER) are used for evaluations are defined below:
Let A be the Count of human faces accepted in the outside database and B be the total count of humans available outside database. Then,
Let C be the count of genuine humans rejected in the inside database and D be the total count of humans in database. Then
Let X be the count of matched humans and Y be the total count of humans available inside database. Then,
The EER describes rate error, where FRR and FAR both are equal. These parameter values of DTCWT, QFT and fusion of these mechanisms are recorded in
Database 
HID: HOD 
DCTWT 
QFT 
FUSION 

TSR (%) 
EER (%) 
Max TSR (%) 
TSR (%) 
EER (%) 
Max TSR (%) 
TSR (%) 
EER (%) 
Max TSR (%) 

Spacek 
20:30 30:20 10:30 30:10 
85 80 88 92 
15 20 12 6 
100 100 100 98 
80 80 84 88 
20 20 16 10 
100 94 100 92 
82.5 80 88 92 
18 18 14 10 
100 100 100 100 
Extended Yale 
12:8 8:12 15:20 20:15 
84 90 91.43 88.57 
14 10 8.5 10.5 
98 96 99 98 
86 82 85.71 82.86 
12 16 15 17 
98 96 100 98 
85 92 92 98.8 
15 10 12 9 
100 100 100 100 
Near Infrared 
10:20 20:10 15:20 20:15 
88 85 85.71 91.43 
12 15 14 8.5 
100 100 100 100 
85 88 82.86 85.71 
15 12 16 15 
100 100 98 98 
86 86 88 90 
14 14 12 12 
100 100 100 100 
ORL 
30:10 10:30 20:30 30:20 
88 84 80 80 
12 16 21 21 
100 95 92 92 
92 88 85 85 
10 12 15 16 
98 100 94 92 
90 88 85 84 
12 14 15 16 
100 100 100 100 
Indian male 
10:15 15:10 20:15 15:20 
82.5 82.5 87.5 87.5 
17 17 13 13 
98 98 96 96 
85 85 83 83 
15 15 15 15 
98 98 95 95 
96 94 84 88 
8 10 15 12 
94 94 92 92 
Indian female 
15:18 18:15 20:25 25:20 
90.90 90.90 86.66 88.88 
10 10 14 12 
93 93 95 95 
87.88 87.88 80 80 
13 13 16 16 
92 92 90 90 
98.5 92 88 92 
8 12 12 15 
92 93 88 90 
The
Techniques 
Spacek 
Extended Yale 
Near Infrared 
ORL 
Indian male 
Indian female 

TSR% 
EER% 
TSR% 
EER% 
TSR% 
EER% 
TSR% 
EER% 
TSR% 
EER% 
TSR% 
EER% 

FFT + DWT 
90 
10 
88.4 
10 
72 
19 
90 
10 
84.40 
16.40 
87.61 
8.21 
LBP +DWT +SOM 
60 
30 
64 
36 
76 
34 
60 
30 
55 
40 
72 
30 
DWT +SVM 
88 
12 
65 
36 
84 
15 
90 
10 
58.33 
44 
77.5 
12.5 
RLM for Canny Edge 
60 
40 
70 
32 
82 
18 
50 
50 
70 
32 
80 
19 
Proposed 
92 
6 
98.8 
8.5 
91.43 
8.5 
92 
10 
96 
8 
98.5 
8 
The
The speech recognition system making fusion of the algorithms MFCC and RASTA are evaluated for the performance against the data sets LibriSpeech corpus, which consists of English speech of 1000 hours, with 16 speakers containing both men and women, speaking distinct words in their sentences. The results are compared with existing speech recognition methods. While doing the experimentation the recognition rate is computed by taking in to account the total count of speakers and count of right matches. The performance metrics including accuracy, F1score, recall and precision are utilized for the evaluation of performance. Where, Accuracy (ACC) is count of data matched rightly out of total data sets, Recall (RC) gives the speech proportion checked positive and recognized, Precision (PR) gives the speech proportion more precisely recognised and F1Score is computed from recall and precision values. The parameters such as True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) are made use for computing these metrics are defined below.
Techniques/Performance Metrics RASTA+LPC+DWT LPCC+MFCC SVM HMM RNN Proposed Accuracy (ACC) 89.5 96.33 87 91 97.49 98.67 Recall (RC) 57.3 96.93 92 93 94.3 98.97 Precision (PR) 83.4 95.73 83 89.5 95.6 98.23 F1Score 67.93 96.35 87.27 91.22 94.94 98.6
The
The results of the experimentation are tabulated and compared with other existing research works as given in the tables,
In this work, both face and voice modalities are discussed along with data sets required for testing face and speech recognition systems. The model is proposed in the paper comprising the face and speech recognition systems, where in face recognition system is implemented by extracting the features of face images using DTCWT and QFT techniques then the fusion of both the techniques is applied. The speech recognition system is implemented by extracting the features of voice data using MFCC and RASTA techniques, and then fusion of both the techniques done to get effective speech recognition system results. The unimodal speech recognition accuracy is 98.67 % and the overall recognition rate of 98.8% is achieved by the proposed model. The results of both the modalities checked and compared with various techniques and demonstrated that the proposed model works better, using performance metrics. In the future work, different biometric traits will be considered, in order to develop system of fusion of more than two biometric modalities, so as to have most advanced and secured human recognition systems.