• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2021, Volume: 14, Issue: 40, Pages: 3026-3050

Review Article

A review on state-of-the-art Automatic Speaker verification system from spoofing and anti-spoofing perspective

Received Date:09 July 2021, Accepted Date:19 October 2021, Published Date:29 November 2021


Background/Objectives: The anti-spoofing measures are blooming with an aim to protect the Automatic Speaker Verification systems from susceptible spoofing attacks. This review is an amalgam of the possible attack types, the datasets required, the renowned feature representation techniques, modeling algorithms involving machine learning, and score normalization techniques. Method/Findings: A detailed analysis of existing datasets is carried based on the total speaker samples, the number of speakers, and source of availability- open or licensed. This may foster choosing the right dataset for building the anti-spoofing frameworks. Further, the feature extraction schemes are elaborated with an intention to cover the vast span of features existing in various parts of raw speech for obtaining speaker-specific traits. Further, the machine learning algorithms ranging from discriminative to generative to mixed form are explored for seeking the right algorithm in specific attack conditions. On the whole, these analyses of existing features and machine learning algorithms together contribute to classifying the unknown test samples as genuine or spoofed. The score normalization techniques are also considered in this review to avoid any misclassifications and ultimately reduce the False Acceptance Ratios. The performance of any anti-spoofing speaker verification system may be evaluated using standard objective measures such are Equal Error Rate, False positive ratios, and graphical plots. These measures are briefly explained in this review. Overall, the critical analysis of individual methods-feature extraction, machine learning, score normalization, and all the anti-spoofing datasets are also discussed for giving a kick-start to any researcher beginning to explore in this direction. The shortcomings and risks involved in building an enhanced speaker verification system that is robust to almost all the attack types are listed in this article. The review of studies conducted so far has led to vital future directions that are enlisted in the concluding remarks of the article.

Keywords: Automatic Speaker Verification; Spoofed Detection, Anti­Spoofing, Voice Conversion, Speech Synthesis, Replay Speech


  1. Ferrer L, Mclaren M, Brümmer N. A speaker verification backend with robust performance across conditions. Comput Speech Lang. 2022;71:101258. doi: 10.1016/j.csl.2021.101258
  2. Jahangir R, Teh YW, Nweke HF, Mujtaba G, MAAG, Ali I. Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Syst Appl. 2021;171:114591. doi: 10.1016/j.eswa.2021.114591
  3. Zeinali H, Sameti H, Burget L. HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification. IEEE/ACM Trans Audio, Speech. 2017;25(7):1421–1456. Available from: http://ieeexplore.ieee.org/document/7902120/DOI: 10.1109/TASLP.2017.2694708
  4. Mtibaa A, Petrovska-Delacrétaz D, Boudy J, Hamida AB. Privacy-preserving speaker verification system based on binary I-vectors. IET Biometrics. 2021;10(3):233–278. Available from: https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/bme2.12013DOI: 10.1049/bme2.12013
  5. Yamagishi J, Todisco M, Sahidullah M, Delgado H, Wang X, Evans N. Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan. ASV Spoof. 2019. Available from: http://dx.doi.org/10.7488/ds/1994
  6. Nautsch A, Wang X, Evans N, Kinnunen TH, Vestman V, Todisco M. Spoofing Countermeasures for the Detection of Synthesized, Converted and Replayed Speech. IEEE Trans Biometrics, Behav Identity Sci. 2019;3(2):252–265. Available from: 10.1109/TBIOM.2021.3059479
  7. Matsubara K, Okamoto T, Takashima R, Takiguchi T, Toda T, Shiga Y. High-Intelligibility Speech Synthesis for Dysarthric Speakers with LPCNet-Based TTS and CycleVAE-Based VC. In: IEEE International Conference on Acoustics, Speech and Signal Processing. (pp. 7058-7062) Institute of Electrical and Electronics Engineers.. 10.1109/ICASSP39728.2021.9414136
  8. Mohammadi SH, Kain A. An overview of voice conversion systems. Speech Commun. 2017;88:65–82.
  9. Marcel S, Nixon M, Fierrez J, Evans N. Handbook of biometric anti-spoofing. 2019.
  10. Wu Z, Evans N, Kinnunen T, Yamagishi J, Alegre F, Li H. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 2015;66:130–53. doi: 10.1016/j.specom.2014.10.005
  11. Wu Z, Yamagishi J, Kinnunen T, Hanilçi C, Sahidullah M, Sizov A. ASVspoof: The automatic speaker verification spoofing and countermeasures challenge. J Sel Top Signal Process. 2017;11(4):130–153. doi: 10.1016/j.specom.2014.10.005
  12. Patil HA, Kamble MR. A Survey on Replay Attack Detection for Automatic Speaker Verification (ASV) System. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. . doi: 10.23919/APSIPA.2018.8659666
  13. Todisco M, Wang X, Vestman V, Sahidullah M, Delgado H, Nautsch A. Future horizons in spoofed and fake audio detection. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2019. 2019;p. 1008–1020. doi: 10.21437/Interspeech.2019-2249
  14. Vestman V, Kinnunen T, Hautamäki RG, Sahidullah M. Voice Mimicry Attacks Assisted by Automatic Speaker Verification. Comput Speech Lang. 2020;59:36–54. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0885230818303863
  15. Farrús M. Voice disguise in automatic speaker recognition. ACM Comput Surv. 2018;51(4). doi: 10.1145/3195832
  16. Kurihara K, Seiyama N, Kumano T, Fukaya T, Saito K, Suzuki S. AI News Anchor” With Deep Learning-Based Speech Synthesis. SMPTE Motion Imaging J. 2021;130:19–27. Available from: https://ieeexplore.ieee.org/document/9395678
  17. Daengsi T, Pornpongtechavanich P, Wuttidittachotti P. Comparison of TTS System Efficiency: A Pilot Study of Word Intelligibility between Siri and Google Translate with Thai Language. International Conference on Artificial Intelligence and Computer Science Technology. 2021;p. 196–205. doi: 10.1109/ICAICST53116.2021.9497835
  18. Chen F, Yang J, Zhao L. A Bilingual Speech Synthesis System of Standard Malay and Indonesian Based on HMM-DNN. 2020 International Conference on Asian Language Processing. 2020;p. 181–187. doi: 10.1109/IALP51396.2020.9310503
  19. Zangar I, Mnasri Z, Colotte V, Jouvet D. Duration modelling and evaluation for Arabic statistical parametric speech synthesis. Multimed Tools Appl. 2020;80(6):8331–8353. doi: 10.1007/S11042-020-09901-7
  20. Lorincz B, Stan A, Giurgiu M. Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis. In: European Signal Processing Conference. Institute of Electrical and Electronics Engineers. 2021.
  21. Xie FL, Li XH, Liu B, Zheng YB, Meng L, Lu L. An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data. IEEE International Conference on Acoustics, Speech and Signal Processing. 2020;p. 7754–7762.
  22. Lee KS. Restricted Boltzmann Machine-Based Voice Conversion for Nonparallel Corpus. IEEE Signal Process Lett. 2017;24(8):1103–1110. doi: 10.1109/LSP.2017.2713412
  23. Kannan S, Raju PR, Madhav R, Tripathi S. Voice Conversion Using Spectral Mapping and TD-PSOLA. Lecture Notes in Eletrical Engineering. Springer. 2021;p. 193–205.
  24. Sisman B, Yamagishi J, King S, Li H. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. Speech, and Language Processing. 2021;p. 132–57.
  25. Zhou Y, Liu Y. Replay Attack Anaysis Based on Acoustic Parameters of Overall Voice Quality. 6th International Conference on Intelligent Computing and Signal Processing. 2021;p. 599–604. doi: 10.1109/ICSP51882.2021.9408884
  26. Genoud D, Spoken GC. Speech pre-processing against intentional imposture in speaker recognition. Fifth International Conference on Spoken Language Processing. ISCA. 1998.
  27. Masuko T, Tokuda K, Kobayashi T, Imai S. Voice characteristics conversion for HMM-based speech synthesis system. International Conference on Acoustics, Speech, and Signal Processing. 1997;p. 1611–1615.
  28. A O. Discrimination Method of Synthetic Speech Using Pitch Frequency against Synthetic Speech Falsification. IEICE Transactions on Fundamentals of Electronics. 2005;p. 280–286.
  29. Leon PLD, Stewart B, Yamagishi J. Synthetic speech discrimination using pitch pattern statistics derived from image analysis. International Speech Communication Association. 2012. doi: 10.1109/ICASSP.2012.6288895
  30. Kinnunen T, Wu Z, Lee KA, Sedlak F, Chng ES, Li H. Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. International Conference on Acoustics, Speech and Signal Processing - Proceedings. 2012. doi: 10.1109/ICASSP.2012.6288895
  31. Wu Z, Xiao X, Chng ES, Li H. Synthetic speech detection using temporal modulation feature. International Conference on Acoustics, Speech and Signal Processing. 2013.
  32. Wu Z, Khodabakhsh A, Demiroglu C, Yamagishi J, Saito D, Toda T. SAS: A speaker verification spoofing database containing diverse attacks. International Conference on Acoustics, Speech and Signal Processing. .
  33. Alam MJ, Kenny P, Bhattacharya G, Stafylakis T. Development of CRIM system for the automatic speaker verification spoofing and countermeasures challenge. International Speech Communication Association, INTERSPEECH. Dresden. 2015.
  34. Prajapati GP, Kamble MR, Patil HA. Energy separation based features for replay spoof detection for voice assistant. European Signal Processing Conference. 2021. .
  35. Liu X, Sahidullah M, Kinnunen T. Learnable MFCCs for Speaker Verification. 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE; 2021. ;p. 1–5.
  36. Kumar MG, Kumar SR, Saranya MS, Bharathi B, Murthy HA. Spoof Detection Using Time-Delay Shallow Neural Network and Feature Switching. Automatic Speech Recognition and Understanding Workshop, ASRU 2019. 2019;p. 1011–1018.
  37. Novoselov S, Kozlov A, Lavrentyeva G, Simonchik K, Shchemelinin V. STC anti-spoofing systems for the ASVspoof 2015 challenge. International Conference on Acoustics, Speech and Signal Processing. 2016.
  38. Tak H, Jung J, Patino J, Kamble M, Todisco M, Evans N. End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection. ASV Spoof 2021 Challenge. 2021. .
  39. Tak H, Patino J, Todisco M, Nautsch A, Evans N, Larcher A. End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection. 2021. doi: 0 Available from: http://arxiv.org/abs/2107.1271
  40. Dua M, Jain C, Kumar S. LSTM and CNN based ensemble approach for spoof detection task in automatic speaker verification systems. J Ambient Intell Humaniz Comput. 2021;p. 1–16.
  41. Kwak IY, Kwag S, Lee J, Huh JH, Lee CH, Jeon Y. Detecting voice spoofing attacks with residual network and max feature map. Proceedings - International Conference on Pattern Recognition. 2020;p. 4837–4881.
  42. Panjwani S, Prakash A. Crowdsourcing Attacks on Biometric Systems. Tenth Symposium On Usable Privacy and Security, SOUPS ’14. 2014.
  43. Zhang C, Bahmaninezhad F, Ranjan S, Dubey H, Xia W, Hansen J. UTD-CRSS Systems for 2018 NIST Speaker Recognition Evaluation. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. 2019;p. 5776–80.
  44. Jung JW, Heo HS, Shim YIH, Yu HJ, HJ. A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. 2018;p. 5349–53.
  45. Lapidot I, Delgado H, Todisco M, Evans N, Bonastre JF. Speech database and protocol validation using waveform entropy. International Speech Communication Association. .
  46. Paul D, Sahidullah M, Saha G. Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora. International Conference on Acoustics, Speech and Signal Processing . .
  47. Voicepa D, Voicepa Dataset - Ddp. 2018. Available from: https://www.idiap.ch/dataset/voicepa
  48. Kinnunen T, Sahidullah M, Falcone M, Costantini L, Hautamäki RG, Thomsen D. RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research. International Conference on Acoustics, Speech and Signal Processing. .
  49. Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N, Yamagishi J. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection. INTERSPEECH. ISCA: ISCA; 2017. ;p. 2–6.
  50. Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee KA. Version 2.0: meta-data analysis and baseline enhancements. Odyssey 2018 The Speaker and Language Recognition Workshop. ISCA: ISCA. 2017;p. 296–303.
  51. Mary L. Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. Springer Briefs Speech Technol. 2019;p. 1–22.
  52. Jia Y, Chen X, Yu J, Wang L, Xu Y, Liu S. Speaker recognition based on characteristic spectrograms and an improved self-organizing feature map neural network. Complex Intell Syst. 2021;7(4):1749–57.
  53. Muckenhirn H, Magimai-Doss M, MS. Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018.
  54. Misra S, Laskar RH, Baruah U, Das TK, Saha P, Choudhury SP. Analysis and extraction of LP-residual for its application in speaker verification system under uncontrolled noisy environment. Multimed Tools Appl. 2017;76(1).
  55. Sukvichai K, Utintu C, Muknumporn W. Automatic Speech Recognition for Thai Sentence based on MFCC and CNNs. 2nd International Symposium on Instrumentation, Control, Artificial Intelligence, and Robotics. .
  56. Özcan Z, Kayıkçıoğlu T. Evaluating MFCC-based speaker identification systems with data envelopment analysis. Expert Syst Appl. 2021;168.
  57. Yang J, Wang H, Das RK, Qian Y. Modified Magnitude-Phase Spectrum Information for Spoofing Detection. IEEE/ACM Trans Audio Speech Lang Process. 2021;29.
  58. Paul D, Pal M, Saha G. Spectral features for synthetic speech detection. IEEE J Sel Top Signal Process. 2017;11(4).
  59. Tak H, Patino J, Nautsch A, Evans N, Todisco M. An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification. The Speaker and Language Recognition Workshop. .
  60. Javed A, Malik KM, Irtaza A, Malik H. Towards protecting cyber-physical and IoT systems from single- and multi-order voice spoofing attacks. Appl Acoust. 2021;183.
  61. Gómez Alanís A, Peinado AM, Gonzalez JA, Gomez A. A Deep Identity Representation for Noise Robust Spoofing Detection. Interspeech 2018. ISCA: ISCA. 2018;p. 676–80.
  62. Todisco M, Delgado H, Evans N. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Comput Speech Lang. 2017;45:516–535. doi: 10.1016/J.CSL.2017.01.001
  63. Yu H, Tan ZH, Ma Z, Martin R, Guo J. Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features. IEEE Trans Neural Networks Learn Syst. 2018;29(10):4633–4677.
  64. Kuznetsov AY, Murtazin RA, Garipov IM, Fedorov EA, Kholodenina AV, Vorobeva AA. Methods of countering speech synthesis attacks on voice biometric systems in banking. Sci Tech J Inf Technol Mech Opt. 2021;21(1).
  65. Dutta K, Singh M, Pati D. Detection of replay signals using excitation source and shifted CQCC features. Int J Speech Technol. 2021.
  66. Chadha AN, Zaveri MA, Sarvaiya JN. Optimal feature extraction and selection techniques for speech processing: A review. International Conference on Communication and Signal Processing. 2016.
  67. Prajapati GP, Kamble MR, Patil HA. Energy separation based features for replay spoof detection for voice assistant. European Signal Processing Conference. European Signal Processing Conference, EUSIPCO; 2021. ;p. 386–90.
  68. Singh M, Pati D. Usefulness of linear prediction residual for replay attack detection. AEU - Int J Electron Commun. 2019.
  69. Li C, Ma X, Jiang B, Li X, Zhang X, Liu X. Deep speaker: An end-to-end neural speaker embedding system. arXiv. 2017.
  70. Slivova M, Voznak M, Tovarek J, Partila P. Detection of speaker liveness with CNN isolated word ASR for verification systems. Multimed Tools Appl. 2021.
  71. Alam MJ, Kenny P, Gupta V. Tandem features for text-dependent speaker verification on the RedDots corpus. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2016.
  72. Shahin I, Nassif AB, Nemmour N, Elnagar A, Alhudhaif A, Polat K. Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments. Neural Comput Appl. 2021.
  73. Hemavathi R, Kumaraswamy R. Voice conversion spoofing detection by exploring artifacts estimates. Multimed Tools Appl. 2021.
  74. Liu Y, He L, Liu J, Johnson MT. Introducing phonetic information to speaker embedding for speaker verification. EURASIP J Audio. 2019;2019(1):19.
  75. Snyder D, Garcia-Romero D, Povey D, Khudanpur S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. IInternational Speech Communication Association, INTERSPEECH. ISCA: ISCA; 2017. ;p. 999–1003.
  76. Nasr S, Quwaider M, Qureshi R. Text-independent Speaker Recognition using Deep Neural Networks. 2021 International Conference on Information Technology (ICIT). IEEE; 2021. ;p. 517–538.
  77. Saranya S, Kumar R, Bharathi S, B. Deep Learning Approach: Detection of Replay Attack in ASV Systems. Advances in Intelligent Systems and Computing. 2020;p. 291–299.
  78. Chintha A, Thai B, Sohrawardi SJ, Bhatt K, Hickerson A, Wright M. Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection. IEEE J Sel Top Signal Process. 2020;14(5).
  79. Hannani A, El, Petrovska-Delacrétaz D, Fauve B, Mayoue A, Mason J, et al. PDD, DB, CG., eds. Text-independent Speaker Verification BT - Guide to Biometric Reference Systems and Performance Evaluation. (pp. 167-211) London. Springer. 2009.
  80. Li P, Li G, Han J, Zhi T, Wang D. Channel Mismatch Speaker Verification Based on Deep Learning and PLDA. In: Journal of Physics: Conference Series. 2020;p. 12056.
  81. Cao W, Liang C, Cao S, Cao W, Liang C, Cao S. Speaker Verification Based on Log-Likelihood Score Normalization. J Comput Commun. 2020;8(11):80–87.
  82. Rakhmanenko I, Kostyuchenko E, Choynzonov E, Balatskaya L, Shelupanov A. Score Normalization of X-Vector Speaker Verification System for Short-Duration Speaker Verification Challenge. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2020;p. 457–66.
  83. Thienpondt J, Desplanques B, Demuynck K. Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization. Proceedings of the Annual Conference of the International Speech Communication Association. 2020;p. 756–60.
  84. Matějka P, Novotný O, Plchot O, Burget L, Sánchez MD, Cěrnocký JH. Analysis of score normalization in multilingual speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association. .
  85. Harper CA, Lyons L, Thornton MA, Larson EC. Enhanced Automatic Modulation Classification using Deep Convolutional Latent Space Pooling. 54th Asilomar Conference on Signals, Systems, and Computers. ;p. 162–167.


© 2021 Chadha et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.