Cover
Vol. 20 No. 1 (2024)

Published: June 30, 2024

Pages: 111-121

Original Article

NNMF with Speaker Clustering in a Uniform Filter-Bank for Blind Speech Separation

Abstract

This study proposes a blind speech separation algorithm that employs a single-channel technique. The algorithm’s input signal is a segment of a mixture of speech for two speakers. At first, filter bank analysis transforms the input from time to time-frequency domain (spectrogram). Number of sub-bands for the filter is 257. Non-Negative Matrix Factorization (NNMF) factorizes each sub-band output into 28 sub-signals. A binary mask separates each sub-signal into two groups; one group belongs to the first speaker and the other to the second speaker. The binary mask separates each sub-signal of the (257×28) 7196 sub-speech signals. That separation cannot identify the speaker. Identification of the sub-signal speaker for each sub-signal is achieved by speaker clustering algorithms. Since speaker clustering cannot process without speaker segmentation, the standard windowed-overlap frames have been used to partition the speech. The speaker clustering process fetches the extracted phase angle from the spectrogram (of the mixture speech) and merges it into the spectrogram (of the recovered speech). Filter bank synthesizes these signals to produce a full-band speech signal for each speaker. Subjective tests denote that the algorithm results are accepted. Objectively, the researchers experimented with 66 mixture chats (6 females and 6 males) to test the algorithm. The average of the SIR test is 11.1 dB, SDR is 1.7 dB, and SAR is 2.8 dB.

References

  1. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 26, no. 10, pp. 1702–1726, 2018.
  2. D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y. Luo, et al., “Integration of speech separation, diarization, and recog- nition for multi-speaker meetings: System description, comparison, and analysis,” in 2021 IEEE spoken lan- guage technology workshop (SLT), pp. 897–904, IEEE, 2021.
  3. P. O. Hoyer, “Non-negative sparse coding,” in Proceed- ings of the 12th IEEE workshop on neural networks for signal processing, pp. 557–565, IEEE, 2002.
  4. A. Ozerov and C. F´evotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio 120 | Ismael & Kadhim source separation,” IEEE transactions on audio, speech, and language processing, vol. 18, no. 3, pp. 550–563, 2009.
  5. B. King, C. F´evotte, and P. Smaragdis, “Optimal cost function and magnitude power for nmf-based speech separation and music interpolation,” in 2012 IEEE In- ternational Workshop on Machine Learning for Signal Processing, pp. 1–6, IEEE, 2012.
  6. H. Kameoka, H. Kagami, and M. Yukawa, “Complex nmf with the generalized kullback-leibler divergence,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 56–60, IEEE, 2017.
  7. H. M.-A. Kadhim, Single channel overlapped-speech detection and separation of spontaneous conversations. PhD thesis, 2018.
  8. H. Sawada, N. Ono, H. Kameoka, D. Kitamura, and H. Saruwatari, “A review of blind source separation methods: two converging routes to ilrma originating from ica and nmf,” APSIPA Transactions on Signal and Information Processing, vol. 8, p. e12, 2019.
  9. W. Yao, D. Lv, X. Huang, J. Zi, M. Gao, R. Xi, and Y. Zhang, “Layered convolutive nonnegative matrix fac- torization for speech separation,” in Journal of Physics: Conference Series, vol. 2258, p. 012020, IOP Publishing, 2022.
  10. D. Wang, T. Li, P. Deng, F. Zhang, W. Huang, P. Zhang, and J. Liu, “A generalized deep learning clustering al- gorithm based on non-negative matrix factorization,” ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 7, pp. 1–20, 2023.
  11. K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Unsupervised speech en- hancement based on multichannel nmf-informed beam- forming for noise-robust automatic speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 27, no. 5, pp. 960–971, 2019.
  12. V. Leplat, N. Gillis, and A. M. Ang, “Blind audio source separation with minimum-volume beta-divergence nmf,” IEEE Transactions on Signal Processing, vol. 68, pp. 3400–3410, 2020.
  13. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25, IEEE, 2021.
  14. L. Rabiner and R. Schafer, Theory and applications of digital speech processing. Prentice Hall Press, 2010.
  15. C. F´evotte, E. Vincent, and A. Ozerov, “Single-channel audio source separation with nmf: divergences, con- straints and algorithms,” Audio Source Separation, pp. 1– 24, 2018.
  16. H. A. Song and S.-Y. Lee, “Hierarchical representation using nmf,” in Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part I 20, pp. 466– 473, Springer, 2013.
  17. M. Kotti, V. Moschou, and C. Kotropoulos, “Speaker segmentation and clustering,” Signal processing, vol. 88, no. 5, pp. 1091–1124, 2008.
  18. I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wil- son, J. Le Roux, and J. R. Hershey, “Universal sound separation,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 175–179, IEEE, 2019.
  19. S. Fern´andez, A. Graves, and J. Schmidhuber, “Phoneme recognition in timit with blstm-ctc,” arXiv preprint arXiv:0804.3269, 2008.
  20. E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, V. Gowreesunker, D. Lutter, and N. Q. Duong, “The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges,” Signal Processing, vol. 92, no. 8, pp. 1928– 1936, 2012.
  21. P. Mowlaee, R. Saeidi, M. G. Christensen, and R. Martin, “Subjective and objective quality assessment of single- channel speech separation algorithms,” in 2012 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 69–72, IEEE, 2012.
  22. T. Virtanen, A. T. Cemgil, and S. Godsill, “Bayesian extensions to non-negative matrix factorisation for au- dio signal modelling,” in 2008 IEEE International Con- ference on Acoustics, Speech and Signal Processing, pp. 1825–1828, IEEE, 2008.
  23. R. Jaiswal, “Non-negative matrix factorization based al- gorithms to cluster frequency basis functions for monau- ral sound source separation.,” 2013.
  24. B. Gao, W. L. Woo, and S. S. Dlay, “Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura–saito nonnegative ma- trix two-dimensional factorizations,” IEEE Transactions 121 | Ismael & Kadhim on Circuits and Systems I: Regular Papers, vol. 60, no. 3, pp. 662–675, 2012.
  25. B. Gao, W. L. Woo, and S. S. Dlay, “Variational reg- ularized 2-d nonnegative matrix factorization,” IEEE transactions on neural networks and learning systems, vol. 23, no. 5, pp. 703–716, 2012.
  26. B. Gao, W. L. Woo, and S. S. Dlay, “Adaptive spar- sity non-negative matrix factorization for single-channel source separation,” IEEE journal of selected topics in signal processing, vol. 5, no. 5, pp. 989–1001, 2011.
  27. A. Al-Tmeme, W. L. Woo, S. S. Dlay, and B. Gao, “Un- derdetermined convolutive source separation using gem- mu with variational approximated optimum model order nmf2d,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 35–49, 2016.
  28. G. Cantisani, S. Essid, and G. Richard, “Neuro-steered music source separation with eeg-based auditory atten- tion decoding and contrastive-nmf,” in ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 36–40, IEEE, 2021.
  29. A. Alghamdi, G. Healy, and H. Abdelhafez, “Real time blind audio source separation based on machine learning algorithms,” in 2020 2nd Novel Intelligent and Lead- ing Emerging Sciences Conference (NILES), pp. 35–40, IEEE, 2020.