Indian classical music is one of the indissoluble aspects of Indian culture and society. A raga is the fundamental structure within Indian classical music, and it is a musical entity with its own musical personality. A raga is a blend of several distinctive swaras and can be identified using different methods like scale matching, Aaroh-Avroh pattern, Pakad matching, pitch class distribution, and Swara intonation.
Carnatic (South Indian) music and Hindustani (North Indian) music are the two types of Indian classical music (in Hindi, "Bhartiya Shaastriya sangeet")
A composition of Hindustani classical music is known as a bandish, which literally means "binding". Each bandish consists of a unique blend of the central elements of Indian classical music. Swara or note, Aaroh-Avroh, Vadi-Samvadi, Gamakas, Pakad, Tala, and Thaat are all central elements that are used to identify a raga
In classical music, raga identification is a challenging task for any researcher
• Understanding music is a challenging task that needs a high level of expertise.
• Many different instruments might be used during the composition of music.
• The raga notes are not in any particular order.
• Because of the many file formats, gathering musical data is challenging.
In the past, Carnatic raga classification has received a lot of attention, but Hindustani raga classification still needs a lot more effort. Due to the fast expansion of the digital music industry, the idea of automated music classification and identification has grown significantly in recent years
The remaining part of the paper is organised as follows: The methodology is discussed in Section 2. Section 3 is on results and discussion. The scope and future work are included in Section 4.
In our experiment, we have collected all the vocal and instrument audio files from the open-source platform YouTube. The dataset consists of songs sung by various musicians, a wide range of compositions, and a diverse variety of ragas. Firstly, we converted each audio clip to .wav format and then divided each into 15 second audio clips. We have approximately 2940 audio clips of classical ragas, including Yaman, Bhairav, Bhairavi, Multani, and Dhanashree. There are 584 audio clips of Yaman, 574 audio clips of Bhairav, 579 audio clips of Bhairavi, 603 audio clips of Dhanashree, and 600 audio clips of Multani. In the aforesaid experiment, we implemented both Mfcc and Mel spectrogram feature extraction approaches with CNN. The aim of this experiment was to compare, implement, and perceive the best approach.
The effectiveness of the entire system can be significantly impacted by data pre-processing. The major goal of the pre-processing procedures is to effectively represent the audio input so that the deep learning models can extract the features quickly
Feature extraction means extracting meaningful characteristics from an audio signal before training any model. It's about how audio signals are processed or manipulated
This is one of the most important approaches for extracting characteristics from an audio signal, and it is used frequently when working with audio signals. It simulates the features of human voice
The Mel-frequency cepstrum (Mfc) is a short-term power spectrum representation of a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency in audio processing. The operations of Mel Frequency Cepstral Coefficients (Mfcc) include windowing the signal, determining the discrete Fourier transform (DFT) coefficients for each window signal, taking the log of the magnitude of the DFT, wrapping frequencies with the Mel scale filter, and then extracting the MFCC coefficients. The following diagram shows the process of Mfcc feature extraction (
A matrix called MFCC is extracted from each raga wave file. The feature set consists of the power spectrogram, mean of the MFCC, spectral centroid, zero-crossing rate, roll-off frequency, and spectral bandwidth. When we implemented Mfcc on the audio dataset, we received the Mfcc result as given in
The frequencies transferred to the Mel scale are known as the Mel spectrogram. Simply put, the Mel scale is the melody of any song. Mathematically, the Mel scale is the result of a nonlinear transformation of the frequency scale. This Mel scale is configured to be "audible" to humans because the evenly spaced sounds on the Mel scale are equidistant from each other. It was discovered that converting audio files to images is a more effective audio pre-processing method than extracting numerical features
• To sample the next window, sample the input using windows with a sampling rate of 22050, a size of nfft = 2048, and a hop length of 512 each time.
• Compute the FFT (Fast Fourier Transform) for each window, convert the time domain to the frequency domain.
• Construct a Mel scale by separating the full frequency spectrum into n_mels = 128 equally spaced frequencies. in which distance is perceived by the human ear.
• Create a spectrogram for every window, decomposing the magnitude of the signal into its components, which corresponds to the frequencies on the Mel scale.
A time-frequency representation of a sound is produced by the Mel spectrogram, simulating the biological auditory systems of humans
A CNN (Convolutional Neural Network) is a subset of a neural network that can extract a raga's specific characteristics from the prominent pitch values of a song
CNN Architecture
We hereby propose a CNN-based approach to extracting features from an image of music audio. The model is designed using convolutional and max pooling layers, which are connected by the softmax layer explained in the
This section presents, comparison of various studies.
|
|
|
|
|
|
1. |
Anand A. |
CNN |
Pitch values Method |
Carnatic |
96% |
2. |
Joshi D, Pareek J, Ambatkar P. |
KNN SVM |
Mfcc |
Hindustani |
98% 95% |
3. |
Vishnupriya S., Meenakshi K. |
CNN |
Mfcc Mel spectrogram |
Music Genre Dataset |
47% 76% |
4. |
John S, Sinith M, RS S, PP L. |
CNN |
Pitch detection algo. |
Carnatic |
94% |
5. |
Shah D, Jagtap N, Talekar P, Gawande K. |
CNN |
Spectrogram |
Hindustani |
98.98% |
6. |
Bidkar A, Deshpande R, Dandawate Y. |
Ensemble bagged tree Ensemble KNN |
Mfcc |
Hindustani |
96.32% 95.83% |
7. |
Patil N., Nemade M. |
KNN Linear Kernel SVM Poly Kernel SVM |
Mfcc |
GTZAN Dataset |
64% 60% 78% |
8. |
Hebbar D., Jagtap V. |
1-D CNN 2-D CNN LSTM ANN |
Mfcc Mel spectrogram |
Carnatic (Pair of Ra gas) |
97.4% 98.1% 97.54% 97% |
9 |
Ghosal D., Kolekar M. |
CNN- LSTM |
Mel Spectrogram |
GTZAN Dataset |
94.2% |
10. |
Dalmazzo D, Ramirez R. |
1-D CNN 2-D CNN CNN-LSTM |
Mel-spectrogram |
professional violinists Dataset |
95.16% 84.30% 97.47% |
11. |
Rajan R, Sreejith S. |
CNN |
Mel-spectrogram |
Carnatic YouTube Dataset |
F1 measure of 0.61 |
12. |
Phulmante V., Bidkar A., Mundada Y., Kulkarni P. |
CNN |
Spectrogram Mfcc Chroma STFT |
GTZAN dataset |
91% 72% 57% |
From the table, it is clearly visible that authors have used different datasets but our dataset is absolutely different.When implemented on a dataset, CNN for both Mfcc and Mel spectrogram feature extraction approaches received the confusion matrix as shown in
It shows the comparison between CNN with Mfcc and CNN with Mel spectrogram. In terms of training accuracy, validation accuracy, and testing accuracy, the CNN achieves its overall performance for all 5 ragas, as illustrated diagrammatically in
For Indian music information retrieval systems, raga identification is a crucial stage. The process of identifying a raga involves identifying distinctive notes and placing them in a predetermined order. In this paper, we have demonstrated an automatic technique for classifying and identifying selected ragas. The goal of this study is to compare two state-of-the-art approaches, CNN with Mfcc and CNN with spectrogram, simultaneously. The conclusion drawn is that developing a model as described can automatically classify and identify Ragas. It performs noticeably better with the Mel spectrogram. We have a distinctive and advanced research approach. As a result of the variety of raga datasets we selected, including Yaman, Bhairav, Bhairavi, Multani, and Dhanashree. We applied a strategy as we extracted features using both the Mfcc and Mel-spectrogram methods. Also, by using CNN, we conducted a comparison of the Mfcc and Mel-spectrogram.
It is also construed that the study entails and carries the potential to further investigate different ragas and increase the available dataset, aiming to achieve even higher performance in the future, as detailed exploration might also be conjoined with different algorithms. In future, researchers can also utilize the raga signal and one-dimensional CNN to perform raga classification.