Cover
Vol. 16 No. 2 (2020)

Published: December 31, 2020

Pages: 91-102

Review Article

A Review on Voice-based Interface for Human-Robot Interaction

Abstract

With the recent developments of technology and the advances in artificial intelligence and machine learning techniques, it has become possible for the robot to understand and respond to voice as part of Human-Robot Interaction (HRI). The voice-based interface robot can recognize the speech information from humans so that it will be able to interact more naturally with its human counterpart in different environments. In this work, a review of the voice-based interface for HRI systems has been presented. The review focuses on voice-based perception in HRI systems from three facets, which are: feature extraction, dimensionality reduction, and semantic understanding. For feature extraction, numerous types of features have been reviewed in various domains, such as time, frequency, cepstral (i.e. implementing the inverse Fourier transform for the signal spectrum logarithm), and deep domains. For dimensionality reduction, subspace learning can be used to eliminate the redundancies of high-dimensional features by further processing extracted features to reflect their semantic information better. For semantic understanding, the aim is to infer from the extracted features the objects or human behaviors. Numerous types of semantic understanding have been reviewed, such as speech recognition, speaker recognition, speaker gender detection, speaker gender and age estimation, and speaker localization. Finally, some of the existing voice-based interface issues and recommendations for future works have been outlined.

References

  1. H. Yan, M. H. Ang, and A. N. Poo, “A survey on perception methods for human–robot interaction in social robots,” Int. J. Soc. Robot., vol. 6, no. 1, pp. 85–119, 2014.
  2. P. Tsarouchi, S. Makris, and G. Chryssolouris, “Human–robot interaction review and challenges on task planning and programming,” Int. J. Comput. Integr. Manuf., vol. 29, no. 8, pp. 916–931, 2016.
  3. N. Lubold, E. Walker, and H. Pon-Barry, “Effects of voice-adaptation and social dialogue on perceptions of a robotic learning companion,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2016, pp. 255–262.
  4. M. A. Goodrich and A. C. Schultz, Human-robot interaction: a survey. Now Publishers Inc, 2008.
  5. H. A. Yanco and J. Drury, “Classifying human-robot interaction: an updated taxonomy,” in 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), 2004, vol. 3, pp. 2841–2846.
  6. M. Ahmad, O. Mubin, and J. Orlando, “A systematic review of adaptivity in human-robot interaction,” Multimodal Technol. Interact., vol. 1, no. 3, p. 14, 2017.
  7. G. Sharma, K. Umapathy, and S. Krishnan, “Trends in audio signal feature extraction methods,” Appl. Acoust., vol. 158, p. 107020, 2020.
  8. T. H. Zaw and N. War, “The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection,” in 2017 20th International Conference of Computer and Information Technology (ICCIT), 2017, pp. 1–5.
  9. X. Yang, B. Tan, J. Ding, J. Zhang, and J. Gong, “Comparative study on voice activity detection algorithm,” in 2010 International Conference on Electrical and Control Engineering, 2010, pp. 599–602.
  10. Y. Korkmaz, A. Boyacı, and T. Tuncer, “Turkish vowel classification based on acoustical and decompositional features optimized by Genetic Algorithm,” Appl. Acoust., vol. 154, pp. 28–35, 2019.
  11. J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggregate features and a da b oost for music classification,” Mach. Learn., vol. 65, no. 2–3, pp. 473– 484, 2006.
  12. O. T.-C. Chen and J. J. Gu, “Improved gender/age recognition system using arousal-selection and feature- selection schemes,” in 2015 IEEE International Conference on Digital Signal Processing (DSP), 2015, pp. 148–152.
  13. M. Farrús, J. Hernando, and P. Ejarque, “Jitter and shimmer measurements for speaker recognition,” in Eighth annual conference of the international speech communication association, 2007.
  14. H.-J. Kim, K. Bae, and H.-S. Yoon, “Age and gender classification for a home-robot service,” in RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, 2007, pp. 122– 126.
  15. F. Metze et al., “Comparison of four approaches to age and gender recognition for telephone applications,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, 2007, vol. 4, pp. IV– 1089.
  16. M. Li, C.-S. Jung, and K. J. Han, “Combining five acoustic level modeling methods for automatic speaker age and gender recognition,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  17. D. Ververidis and C. Kotropoulos, “Fast sequential floating forward selection applied to emotional speech features estimated on DES and SUSAS data collections,” in 2006 14th European Signal Processing Conference, 2006, pp. 1–5.
  18. M. P. Michalowski, S. Sabanovic, and H. Kozima, “A dancing robot for rhythmic social interaction,” in Badr & Abdul-Hassan | 101 Proceedings of the ACM/IEEE international conference on Human-robot interaction, 2007, pp. 89–96.
  19. S. A. Alim and N. K. A. Rashid, “Some commonly used speech feature extraction algorithms,” From Nat. to Artif. Intell. Appl., 2018.
  20. K. Mannepalli, P. N. Sastry, and M. Suman, “Accent Recognition System Using Deep Belief Networks for Telugu Speech Signals,” in Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, 2017, pp. 99–105.
  21. B. D. Barkana and J. Zhou, “A new pitch-range based feature set for a speaker’s age and gender classification,” Appl. Acoust., vol. 98, pp. 52–61, 2015.
  22. J.-W. Lee, S. Kim, and H.-G. Kang, “Detecting pathological speech using contour modeling of harmonic- to-noise ratio,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 5969–5973.
  23. Y. V. S. Murthy and S. G. Koolagudi, “Classification of vocal and non-vocal regions from audio songs using spectral features and pitch variations,” in 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE), 2015, pp. 1271–1276.
  24. G. Dobry, R. M. Hecht, M. Avigal, and Y. Zigel, “Supervector dimension reduction for efficient speaker age estimation based on the acoustic speech signal,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 7, pp. 1975–1985, 2011.
  25. A. G. Jondya and B. H. Iswanto, “Indonesian’s traditional music clustering based on audio features,” Procedia Comput. Sci., vol. 116, pp. 174–181, 2017.
  26. L. Lu, D. Liu, and H.-J. Zhang, “Automatic mood detection and tracking of music audio signals,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 14, no. 1, pp. 5–18, 2005.
  27. A. I. Al-Shoshan, “Speech and music classification and separation: a review,” J. King Saud Univ. Sci., vol. 19, no. 1, pp. 95–132, 2006.
  28. T. Tuncer and S. Dogan, “A novel octopus based Parkinson’s disease and gender recognition method using vowels,” Appl. Acoust., vol. 155, pp. 75–83, 2019.
  29. S. I. Levitan, T. Mishra, and S. Bangalore, “Automatic identification of gender from speech,” in Proceeding of speech prosody, 2016, pp. 84–88.
  30. T. Bocklet, G. Stemmer, V. Zeissler, and E. Nöth, “Age and gender recognition based on multiple systems-early vs. late fusion,” in Eleventh Annual Conference of the International Speech Communication Association, 2010.
  31. E. S. Wahyuni, “Arabic speech recognition using MFCC feature extraction and ANN classification,” in 2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE), 2017, pp. 22–25.
  32. A. Krueger and R. Haeb-Umbach, “Model-based feature enhancement for reverberant speech recognition,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 18, no. 7, pp. 1692–1707, 2010.
  33. A. K. Abdul-Hassan and I. H. Hadi, “Speaker Classification using DTW and A Proposed Fuzzy Classifier,” J. Al-Qadisiyah Comput. Sci. Math., vol. 11, no. 4, p. Page-17, 2019.
  34. C. Sunitha and E. Chandra, “Speaker Recognition using MFCC and Improved Weighted Vector Quantization Algorithm,” Int. J. Eng. Technol., vol. 7, pp. 1685–1692, Nov. 2015.
  35. A. K. Abdul-Hassan and I. H. Hadi, “A Proposed Authentication Approach Based on Voice and Fuzzy Logic,” in Recent Trends in Intelligent Computing, Communication and Devices, Springer, 2020, pp. 489– 502.
  36. T. Kinnunen, E. Chernenko, M. Tuononen, P. Fränti, and H. Li, “Voice activity detection using MFCC features and support vector machine,” in Int. Conf. on Speech and Computer (SPECOM07), Moscow, Russia, 2007, vol. 2, pp. 556–561.
  37. G. ALIPOOR and E. SAMADI, “Robust Speaker Gender Identification Using Empirical Mode Decomposition-Based Cepstral Features,” vol, vol. 7, pp. 71–81.
  38. M.-W. Lee and K.-C. Kwak, “Performance comparison of gender and age group recognition for human-robot interaction,” IJACSA) Int. J. Adv. Comput. Sci. Appl., vol. 3, no. 12, 2012.
  39. S. Safavi, M. Russell, and P. Jančovič, “Automatic speaker, age-group and gender identification from children’s speech,” Comput. Speech Lang., vol. 50, pp. 141–156, 2018.
  40. T. Vogt and E. André, “Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition,” in 2005 IEEE International Conference on Multimedia and Expo, 2005, pp. 474–477.
  41. H. Gupta and D. Gupta, “LPC and LPCC method of feature extraction in Speech Recognition System,” in 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), 2016, pp. 498–502.
  42. N. Dave, “Feature extraction methods LPC, PLP and MFCC in speech recognition,” Int. J. Adv. Res. Eng. Technol., vol. 1, no. 6, pp. 1–4, 2013.
  43. M. Glodek et al., “Multiple classifier systems for the classification of audio-visual emotional states,” in International Conference on Affective Computing and Intelligent Interaction, 2011, pp. 359–368.
  44. Y.-M. Zeng, Z.-Y. Wu, T. Falk, and W.-Y. Chan, “Robust GMM based gender classification using pitch and RASTA-PLP parameters of speech,” in 2006 International Conference on Machine Learning and Cybernetics, 2006, pp. 3376–3379.
  45. M. A. A. Zulkifly and N. Yahya, “Relative spectral- perceptual linear prediction (RASTA-PLP) speech signals analysis using singular value decomposition (SVD),” in 2017 IEEE 3rd International Symposium in Robotics and Manufacturing Automation (ROMA), 2017, pp. 1–5.
  46. H. Yin, V. Hohmann, and C. Nadeu, “Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency,” Speech Commun., vol. 53, no. 5, pp. 707–715, 2011.
  47. G. K. Liu, “Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech,” arXiv Prepr. arXiv1806.09010, 2018.
  48. M. H. Rahmani, F. Almasganj, and S. A. Seyyedsalehi, Badr & Abdul-Hassan “Audio-visual feature fusion via deep neural networks for automatic speech recognition,” Digit. Signal Process., vol. 82, pp. 54–63, 2018.
  49. F. Ertam, “An effective gender recognition approach using voice data via deeper LSTM networks,” Appl. Acoust., vol. 156, pp. 351–358, 2019.
  50. A. A. Mallouh, Z. Qawaqneh, and B. Barkana, “New transformed features generated by deep bottleneck extractor and a GMm_UBM classifier for speaker age and gender classification,” Neural Comput. Appl., vol. 30, pp. 2581–2593, 2017.
  51. Y. Li, X. Li, Y. Zhang, W. Wang, M. Liu, and X. Feng, “Acoustic scene classification using deep audio feature and BLSTM network,” in 2018 International Conference on Audio, Language and Image Processing (ICALIP), 2018, pp. 371–374.
  52. A. M. Badshah et al., “Deep features-based speech emotion recognition for smart affective services,” Multimed. Tools Appl., vol. 78, no. 5, pp. 5571–5589, 2019.
  53. Y. Qian, N. Chen, and K. Yu, “Deep features for automatic spoofing detection,” Speech Commun., vol. 85, pp. 43–52, 2016.
  54. N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Trans. Multimed., vol. 20, no. 3, pp. 513–524, 2017.
  55. M. Papakostas and T. Giannakopoulos, “Speech-music discrimination using deep visual feature extractors,” Expert Syst. Appl., vol. 114, pp. 334–344, 2018.
  56. J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.
  57. S. Balakrishnama and A. Ganapathiraju, “Linear discriminant analysis-a brief tutorial,” in Institute for Signal and information Processing, 1998, vol. 18, no. 1998, pp. 1–8.
  58. O. Mubin, J. Henderson, and C. Bartneck, “You just do not understand me! Speech Recognition in Human Robot Interaction,” in The 23rd IEEE International Symposium on Robot and Human Interactive Communication, 2014, pp. 637–642.
  59. S. D. Dhingra, G. Nijhawan, and P. Pandit, “Isolated speech recognition using MFCC and DTW,” Int. J. Adv. Res. Electr. Electron. Instrum. Eng., vol. 2, no. 8, pp. 4085–4092, 2013.
  60. V. Delić et al., “Speech technology progress based on new machine learning paradigm,” Comput. Intell. Neurosci., vol. 2019, 2019.
  61. M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun., vol. 54, no. 4, pp. 543–565, 2012.
  62. C. Breazeal and L. Aryananda, “Recognition of affective communicative intent in robot-directed speech,” Auton. Robots, vol. 12, no. 1, pp. 83–104, 2002.
  63. P. Ruvolo, I. Fasel, and J. Movellan, “Auditory mood detection for social and educational robots,” in 2008 IEEE International Conference on Robotics and Automation, 2008, pp. 3551–3556.
  64. F. Kraft, R. Malkin, T. Schaaf, and A. Waibel, “Temporal ICA for classification of acoustic events ia kitchen environment,” in Ninth European Conference on Speech Communication and Technology, 2005.
  65. A. Betkowska, K. Shinoda, and S. Furui, “Robust speech recognition using factorial HMMs for home environments,” EURASIP J. Adv. Signal Process., vol. 2007, no. 1, p. 20593, 2007.
  66. A. Austermann, N. Esau, L. Kleinjohann, and B. Kleinjohann, “Prosody based emotion recognition for MEXI,” in 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2005, pp. 1138–1144.
  67. N. Schmitz, C. Spranger, and K. Berns, “3D audio perception system for humanoid robots,” in 2009 Second International Conferences on Advances in Computer- Human Interactions, 2009, pp. 181–186.
  68. C. Granata, M. Chetouani, A. Tapus, P. Bidaud, and V. Dupourqué, “Voice and graphical-based interfaces for interaction with a robot dedicated to elderly and people with cognitive disorders,” in 19th International Symposium in Robot and Human Interactive Communication, 2010, pp. 785–790.
  69. M. Strait, P. Briggs, and M. Scheutz, “Gender, more so than age, modulates positive perceptions of language- based human-robot interactions,” in 4th international symposium on new frontiers in human robot interaction, 2015, pp. 21–22.
  70. M. Tahon, A. Delaborde, and L. Devillers, “Real-life emotion detection from speech in human-robot interaction: Experiments across diverse corpora with child and adult voices,” 2011.
  71. A. Niculescu, B. Van Dijk, A. Nijholt, and S. L. See, “The influence of voice pitch on the evaluation of a social robot receptionist,” in 2011 International Conference on User Science and Engineering (i-USEr), 2011, pp. 18–23.
  72. A. Poncela and L. Gallardo-Estrella, “Command-based voice teleoperation of a mobile robot via a human-robot interface,” Robotica, vol. 33, no. 1, p. 1, 2015.