Abstract
Aspeech – music – silence discrimination and a gender detection algorithm is presented in this document. First silence segments are extracted from audio stream by using energy and ZCR features. The speech are detected using the energy envelope and harmonic features. Music segments are then classified using energy envelope, and harmonic features too. For gender detection, we propose a feature that we used to discriminate between men/women. The proposed algorithm needs no training phase, as in Gaussian Mixture Models based algorithms, and it classifies audio stream into 4 classes, speech, music, silence, and else with a delay of 1 s. Once speech is extracted, gender detection could be applied, and the detection could be in real time. As an evaluation of the proposed algorithm, we applied it on 10 min of audio extracted from CNN programs, 93% of classification accuracy for speech and 84% of classification accuracy for music is achieved. 80% of gender detection accuracy in speech segments is achieved.
Introduction
With more and more digital video data added and archived every day, and with the digitalization of archives, one could expect the massive video data existing today. Searching segments of interest within video programs, as broadcast news, TV programs, scenes in film, is very hard. On the other hand, manual indexing of video programs is very costly and slow. Clearly we need powerful automatic methods for content indexing. As a video program is usually composed of a visual channel and one or several audio channels, an automatic video segmentation process should rely on the visual and audio channel analysis. Till now, most video segmentation techniques are based on visual analysis here we investigate the audio part of a video for indexing purpose. A very basic segmentation within audio channel is speech / music / silence / noise discrimination which helps improving scene segmentation when combined with visual based segmentation techniques. Other motivations for the audio channel segmentation classification include :
1- giving more reliable data (speech only) to the ASR (Automatic Speech Recognizer), which minimizes word error rates, out of vocabulary and computations on non - speech data. Also when we have reliable text spoken in a Video (ASR output) we could extract knowledge from this text and combine it with knowledge extracted from image processing.
2- Giving meaningful descriptors such as music, silence, speech, … as in the MPEG7 specifications .
3- By extracting speech we could apply speaker recognition techniques for identifying and tracking specific speakers in a video. Which is a very important descriptor for content indexing.
4- Improving audio coding, by decreasing the bit rate for silence segments, already extracted reliably.
Humans classify and segment audio every day without a considerable effort. In fact, recognizing audio classes and discriminating between music speech, and silence, is one of many other pattern recognition cases processed all the time by humans. Also, solving audio recognition problem, as for all pattern recognition problems, is done in an efficient and reliable matter using our brain and nervous system. An efficiency that the most powerful computers, using current state of the art algorithms for pattern recognition are away from it. What makes the problem of audio recognition as music / speech discrimination a hard task, is that finding features or mathematical models that describe all the variability of classes efficiently is not evident. Features used for audio recognition are low level features, such as MFCC or cepstrum or DFT features. These features perform well for speech recognition where small variations are important for phoneme recognition. But when classes are music, or speech or noise, where the definition of classes is more general, we need some kind of high level features that could describe a class in general way.
Approaches proposed in the literature for audio segmentation classification can be summarized as follows:-
1. Model - based approaches:- where models for each audio class is created, such as Gaussian Mixture Models .These models are based on low level features in general, such as cepstrum, MFCC. Then after a training phase, audio could be classified as one of these classes. With advantage of classifying and segmenting at the same time. The problems are data and time needed for training.
2.. Metric-based segmentation: -which segments using distances between neighboring windows different acoustic features could be used to create the acoustic vectors, Auto - regressive Gaussian model parameters DFT parameters . The advantage of this approach is that there is no need for prior knowledge on audio classes, and it could be applied for real time segmentation . But it is sensitive for local changes in acoustical features, and it gives no information about audio segments.
2. Rule based approaches: in these approaches rules are created describing each class, [Khou]. These rules are based on height and low level features.
3. Decoder based approaches: in these approaches the Hidden Markov Model HMM of the speech recognition system is used. HMM are trained to give the class of the audio signal [Kubala].
Here we propose a set of features or combination of features that aims to describe music, speech and silence classes in a general matter. Using these features we propose an algorithm that discriminates between our audio classes in an IF-THEN fashion. Then we propose a parameter that could discriminates between men and women's voices.
Speech, Music, Silence Properties
Before segmentation and classification, a study of the properties of audio classes is essential.
Silence
Silence is defined as a non perceptual audio signal, for the human ear. Normally the energy level of silence is relatively low. So an energy thresholding could extract silence segments. But there exists other types of audio segments that could be classified as silence if energy level is used alone, such us a low energy music or speech. Fortunately the Zero Crossing Rate ZCR for silence is generally less than ZCR, for other types of audio. By intuition combining these 2 features, energy and ZCR could improve the accuracy of silence detection process.
Let Sw = Ew. ZCR w where Ew and ZCRw are respectively the normalized energy, and the Zero Crossing Rate of the window “W”. Thresholding Sw could extract silence segments
Zero crossing detection is the most common method for measuring the frequency or the period of a periodic signal . When measuring the frequency of a signal, usually the number of cycles of a reference signal is measured over one or more time period of a signal being measured.
Fig. 2.1 Background noise music (a), and detection using our variable “s” (b), and by simply thresholding the energy (c)
In Fig 2.1, we show a signal containing background noise, and music. By using the energy feature only, we risk to detect a large part of non-silence as silence. But when using our variable “S” based on the energy and the ZCR, we minimize this risk.
Speech
Speech is defined as a series of words spoken in a continuous fashion, other types of speech, such as a shouting person, is considered as non speech. One could expect higher energy levels when words are spoken, and lower energy for inter words duration
Two things characterize speech:-
First an alternation in the energy between peaks, corresponding of words, and almost zero energy, corresponding of inter words silence. This characteristic is helpful for detecting speech signals. We proposed to count in window of 1s the number of times the energy falls below the silence level. We call this feature, the Silence Crossing Rate SCR. Experiments show that the SCR is between 5 and 10 for speech. For other types of audio, such as music, it is higher or lower.
Fig. The Shape of Energy Envelope for speech and music (bottom), we see the distribution peaks in the speech signal
An other important characteristic for speech, is that Frequency Tracking gives dispersed and short segments.
The Frequency Tracking is as follows:
In the spectrum of the signal. We try to track the five more important Frequencies in the DFT ( Discrete Fourier Transform coefficients) vectors. Tracking means to see if these frequencies continue to be the most important frequencies within 100 Hz band in the next DFT vector. When a frequency could not been tracked, we signal a cut. The number of cuts in a window of 1s is the Frequency Tracking FT feature. The FT for speech signals gives short dispersed segments and it looks like this :
Fig. Spectrum and frequency tracking for speech signals
Experiments show that FT for speech signals is clearly higher than for music ones. For music, the changes in the spectrum are smooth in general. For music the FT gives long parallel segments. It look like:
Fig. Spectrum and frequency tracking for music signals
Now a feature based on the combination of the FT and SCR could discriminate between music and speech.
Let Pw = G (SCRw)+ f (FTw) Where SCRw and FTw are the Silence Crossing
Rate and the Frequency Tracking.
Functions f(.), g(.) are functions that maps from real numbers to the [0,1] interval.
These functions could be linear ones, and they could be defined after experiments to maximize Pw for speech segments.The same could be done for music, by defining a feature Mw
Mw = gl (SCRw)+ f1 (FTw)
Now we choose functions that maximizes Mw for music.
A basic definition of functions g(.) and f(.) could be as follows:
g(x) = 1 if 4 < x < 11 = 0 otherwise f(x) = 1 if x > 75= 0 otherwise and
g1(x) = 1 - g(x)
f1(x) = 1- f(x)
finally if Pw=2 this implies a speech segment if Mw = 2 this implies a music segment otherwise this is an "else" segment.
Gender detection
In videos, a meaningful descriptor that could helps improving scenes detection using the visual stream of the video, is the gender of the speaker. we are interested in the discrimination between man and woman in speech segments already extracted. Humans discriminate between men and women according to the frequency. Women speak with higher fundamental frequencies than men, and the ZCR for a woman’s voice is higher than that for man’s voice.
Another important thing in man’s and woman’s voices, is that the center of gravity of the spectrum is, for man’s voices close to low frequencies; and to higher frequencies for women’s ones. The center of gravity of an acoustic vector could be calculated as follows Let X the acoustic vector
containing frequency coefficients. And G the center of gravity of this vector.
where X f is the coefficient corresponding of the frequency “f” in the vector X,and “f” is the frequency index.
Woman Man
Fig. 3.2, frequency distribution for woman and man’s voice
Fig3.2, frequency distribution for a woman and man's voice Fig3.2 shows the frequency distribution for women’s and men’s voices. We see clearly that the frequency distribution for men's voices is closer to low frequencies than those of women's voices.
The proposed variable to discriminate between man and woman is the following :-
W could be calculated for every acoustic vector in a speech segment, and refined for every 1 s by Mean (ZCR). This variable should be higher for men’s voices.
Discrimination algorithm
1- For every 10 ms an acoustic vector containing DFT coefficients for frequencies 100 Hz to 4KHz is extracted ,the energy is then calculated by the following formula:“t” means that the energy is calculated for every 10 ms (acoustic vector ).
2- For every 10 ms calculate ZCR. ZCR is calculated by the following formula : w(n) is a rectangular window of length n, 10ms in our case.
3- Extract silence segments using the feature “S”
4- Discard silence segments in further studies.
5- Extract speech using the feature “P”
6- Discriminate between men’s and women’s voices in speech segments, using the feature “W”.
7- Discard speech segments in further studies.
8- Extract music segments using feature “M”
9- Label unclassified segments as “else”
The segmentation of audio stream is done after this classification. Segment boundaries are placed where their is a class change.
Results
To evaluate the proposed algorithm we used it for the classification of 10 min of audio data collected from CNN programs, and 10 min from film “unindien dans
la ville”. In these streams music segments, speech (spoken by men and women), silence and noise exist.
Results are listed in table 5.1
Insertion rate is the percentage of non-speech segments classified as speech. Deletion rate is the percentage of speech segments classified as non speech.
And the accuracy for music classification on different segments extracted from songs, and instrumental music.
Conclusion
First a study on the audio classes we used, speech, music and silence are done. We Proposed a set of features that could reflect the properties of the defined audio classes. By thresholding these features a classification – segmentation could be done.
We showed that our method for silence detection is more robust for noise, or background music as in Then the problem of gender detection is presented, and a feature for this task is proposed.
We showed that the proposed features could discriminate between our audio classes, in experiments we did.
With no prior training phase, as for GMM based algorithms, and by simply thresholding the proposed features 90% of classification accuracy is achieved. The problem with this algorithm still the use of thresholds. As future directions, modeling these features by GMM for example will be done, to eliminate the use of thresholds, and to increase the robustness. Speaker detection, and merging audio information with visual studies already done to improve scene detection in video programs, will be investigated.
Aspeech – music – silence discrimination and a gender detection algorithm is presented in this document. First silence segments are extracted from audio stream by using energy and ZCR features. The speech are detected using the energy envelope and harmonic features. Music segments are then classified using energy envelope, and harmonic features too. For gender detection, we propose a feature that we used to discriminate between men/women. The proposed algorithm needs no training phase, as in Gaussian Mixture Models based algorithms, and it classifies audio stream into 4 classes, speech, music, silence, and else with a delay of 1 s. Once speech is extracted, gender detection could be applied, and the detection could be in real time. As an evaluation of the proposed algorithm, we applied it on 10 min of audio extracted from CNN programs, 93% of classification accuracy for speech and 84% of classification accuracy for music is achieved. 80% of gender detection accuracy in speech segments is achieved.
Introduction
With more and more digital video data added and archived every day, and with the digitalization of archives, one could expect the massive video data existing today. Searching segments of interest within video programs, as broadcast news, TV programs, scenes in film, is very hard. On the other hand, manual indexing of video programs is very costly and slow. Clearly we need powerful automatic methods for content indexing. As a video program is usually composed of a visual channel and one or several audio channels, an automatic video segmentation process should rely on the visual and audio channel analysis. Till now, most video segmentation techniques are based on visual analysis here we investigate the audio part of a video for indexing purpose. A very basic segmentation within audio channel is speech / music / silence / noise discrimination which helps improving scene segmentation when combined with visual based segmentation techniques. Other motivations for the audio channel segmentation classification include :
1- giving more reliable data (speech only) to the ASR (Automatic Speech Recognizer), which minimizes word error rates, out of vocabulary and computations on non - speech data. Also when we have reliable text spoken in a Video (ASR output) we could extract knowledge from this text and combine it with knowledge extracted from image processing.
2- Giving meaningful descriptors such as music, silence, speech, … as in the MPEG7 specifications .
3- By extracting speech we could apply speaker recognition techniques for identifying and tracking specific speakers in a video. Which is a very important descriptor for content indexing.
4- Improving audio coding, by decreasing the bit rate for silence segments, already extracted reliably.
Humans classify and segment audio every day without a considerable effort. In fact, recognizing audio classes and discriminating between music speech, and silence, is one of many other pattern recognition cases processed all the time by humans. Also, solving audio recognition problem, as for all pattern recognition problems, is done in an efficient and reliable matter using our brain and nervous system. An efficiency that the most powerful computers, using current state of the art algorithms for pattern recognition are away from it. What makes the problem of audio recognition as music / speech discrimination a hard task, is that finding features or mathematical models that describe all the variability of classes efficiently is not evident. Features used for audio recognition are low level features, such as MFCC or cepstrum or DFT features. These features perform well for speech recognition where small variations are important for phoneme recognition. But when classes are music, or speech or noise, where the definition of classes is more general, we need some kind of high level features that could describe a class in general way.
Approaches proposed in the literature for audio segmentation classification can be summarized as follows:-
1. Model - based approaches:- where models for each audio class is created, such as Gaussian Mixture Models .These models are based on low level features in general, such as cepstrum, MFCC. Then after a training phase, audio could be classified as one of these classes. With advantage of classifying and segmenting at the same time. The problems are data and time needed for training.
2.. Metric-based segmentation: -which segments using distances between neighboring windows different acoustic features could be used to create the acoustic vectors, Auto - regressive Gaussian model parameters DFT parameters . The advantage of this approach is that there is no need for prior knowledge on audio classes, and it could be applied for real time segmentation . But it is sensitive for local changes in acoustical features, and it gives no information about audio segments.
2. Rule based approaches: in these approaches rules are created describing each class, [Khou]. These rules are based on height and low level features.
3. Decoder based approaches: in these approaches the Hidden Markov Model HMM of the speech recognition system is used. HMM are trained to give the class of the audio signal [Kubala].
Here we propose a set of features or combination of features that aims to describe music, speech and silence classes in a general matter. Using these features we propose an algorithm that discriminates between our audio classes in an IF-THEN fashion. Then we propose a parameter that could discriminates between men and women's voices.
Speech, Music, Silence Properties
Before segmentation and classification, a study of the properties of audio classes is essential.
Silence
Silence is defined as a non perceptual audio signal, for the human ear. Normally the energy level of silence is relatively low. So an energy thresholding could extract silence segments. But there exists other types of audio segments that could be classified as silence if energy level is used alone, such us a low energy music or speech. Fortunately the Zero Crossing Rate ZCR for silence is generally less than ZCR, for other types of audio. By intuition combining these 2 features, energy and ZCR could improve the accuracy of silence detection process.
Let Sw = Ew. ZCR w where Ew and ZCRw are respectively the normalized energy, and the Zero Crossing Rate of the window “W”. Thresholding Sw could extract silence segments
Zero crossing detection is the most common method for measuring the frequency or the period of a periodic signal . When measuring the frequency of a signal, usually the number of cycles of a reference signal is measured over one or more time period of a signal being measured.
Fig. 2.1 Background noise music (a), and detection using our variable “s” (b), and by simply thresholding the energy (c)
In Fig 2.1, we show a signal containing background noise, and music. By using the energy feature only, we risk to detect a large part of non-silence as silence. But when using our variable “S” based on the energy and the ZCR, we minimize this risk.
Speech
Speech is defined as a series of words spoken in a continuous fashion, other types of speech, such as a shouting person, is considered as non speech. One could expect higher energy levels when words are spoken, and lower energy for inter words duration
Two things characterize speech:-
First an alternation in the energy between peaks, corresponding of words, and almost zero energy, corresponding of inter words silence. This characteristic is helpful for detecting speech signals. We proposed to count in window of 1s the number of times the energy falls below the silence level. We call this feature, the Silence Crossing Rate SCR. Experiments show that the SCR is between 5 and 10 for speech. For other types of audio, such as music, it is higher or lower.
Fig. The Shape of Energy Envelope for speech and music (bottom), we see the distribution peaks in the speech signal
An other important characteristic for speech, is that Frequency Tracking gives dispersed and short segments.
The Frequency Tracking is as follows:
In the spectrum of the signal. We try to track the five more important Frequencies in the DFT ( Discrete Fourier Transform coefficients) vectors. Tracking means to see if these frequencies continue to be the most important frequencies within 100 Hz band in the next DFT vector. When a frequency could not been tracked, we signal a cut. The number of cuts in a window of 1s is the Frequency Tracking FT feature. The FT for speech signals gives short dispersed segments and it looks like this :
Fig. Spectrum and frequency tracking for speech signals
Experiments show that FT for speech signals is clearly higher than for music ones. For music, the changes in the spectrum are smooth in general. For music the FT gives long parallel segments. It look like:
Fig. Spectrum and frequency tracking for music signals
Now a feature based on the combination of the FT and SCR could discriminate between music and speech.
Let Pw = G (SCRw)+ f (FTw) Where SCRw and FTw are the Silence Crossing
Rate and the Frequency Tracking.
Functions f(.), g(.) are functions that maps from real numbers to the [0,1] interval.
These functions could be linear ones, and they could be defined after experiments to maximize Pw for speech segments.The same could be done for music, by defining a feature Mw
Mw = gl (SCRw)+ f1 (FTw)
Now we choose functions that maximizes Mw for music.
A basic definition of functions g(.) and f(.) could be as follows:
g(x) = 1 if 4 < x < 11 = 0 otherwise f(x) = 1 if x > 75= 0 otherwise and
g1(x) = 1 - g(x)
f1(x) = 1- f(x)
finally if Pw=2 this implies a speech segment if Mw = 2 this implies a music segment otherwise this is an "else" segment.
Gender detection
In videos, a meaningful descriptor that could helps improving scenes detection using the visual stream of the video, is the gender of the speaker. we are interested in the discrimination between man and woman in speech segments already extracted. Humans discriminate between men and women according to the frequency. Women speak with higher fundamental frequencies than men, and the ZCR for a woman’s voice is higher than that for man’s voice.
Another important thing in man’s and woman’s voices, is that the center of gravity of the spectrum is, for man’s voices close to low frequencies; and to higher frequencies for women’s ones. The center of gravity of an acoustic vector could be calculated as follows Let X the acoustic vector
containing frequency coefficients. And G the center of gravity of this vector.
where X f is the coefficient corresponding of the frequency “f” in the vector X,and “f” is the frequency index.
Woman Man
Fig. 3.2, frequency distribution for woman and man’s voice
Fig3.2, frequency distribution for a woman and man's voice Fig3.2 shows the frequency distribution for women’s and men’s voices. We see clearly that the frequency distribution for men's voices is closer to low frequencies than those of women's voices.
The proposed variable to discriminate between man and woman is the following :-
W could be calculated for every acoustic vector in a speech segment, and refined for every 1 s by Mean (ZCR). This variable should be higher for men’s voices.
Discrimination algorithm
1- For every 10 ms an acoustic vector containing DFT coefficients for frequencies 100 Hz to 4KHz is extracted ,the energy is then calculated by the following formula:“t” means that the energy is calculated for every 10 ms (acoustic vector ).
2- For every 10 ms calculate ZCR. ZCR is calculated by the following formula : w(n) is a rectangular window of length n, 10ms in our case.
3- Extract silence segments using the feature “S”
4- Discard silence segments in further studies.
5- Extract speech using the feature “P”
6- Discriminate between men’s and women’s voices in speech segments, using the feature “W”.
7- Discard speech segments in further studies.
8- Extract music segments using feature “M”
9- Label unclassified segments as “else”
The segmentation of audio stream is done after this classification. Segment boundaries are placed where their is a class change.
Results
To evaluate the proposed algorithm we used it for the classification of 10 min of audio data collected from CNN programs, and 10 min from film “unindien dans
la ville”. In these streams music segments, speech (spoken by men and women), silence and noise exist.
Results are listed in table 5.1
Insertion rate is the percentage of non-speech segments classified as speech. Deletion rate is the percentage of speech segments classified as non speech.
And the accuracy for music classification on different segments extracted from songs, and instrumental music.
Conclusion
First a study on the audio classes we used, speech, music and silence are done. We Proposed a set of features that could reflect the properties of the defined audio classes. By thresholding these features a classification – segmentation could be done.
We showed that our method for silence detection is more robust for noise, or background music as in Then the problem of gender detection is presented, and a feature for this task is proposed.
We showed that the proposed features could discriminate between our audio classes, in experiments we did.
With no prior training phase, as for GMM based algorithms, and by simply thresholding the proposed features 90% of classification accuracy is achieved. The problem with this algorithm still the use of thresholds. As future directions, modeling these features by GMM for example will be done, to eliminate the use of thresholds, and to increase the robustness. Speaker detection, and merging audio information with visual studies already done to improve scene detection in video programs, will be investigated.
Download this Seminar Report
0 comments:
Post a Comment
Thanks for your Valuable comment