I'm new to audio analysis, but need to perform a (seemingly) simple task. I have a byte array containing a 16 bit recording (single channel) and a sample rate of 44100. How do I perform a quick analysis to get the volume at any given moment? I need to calculate a threshold, so a function to return true if it's above a certain amplitude (volume) and false if not. I thought I could iterate through the byte array and check its value, with 255 being the loudest, but this doesn't seem to work as even when I don't record anything, background noise gets in and some of the array is filled with 255. Any suggestions would be great.
Thanks
As you have 16-bit data, you should expect the signal to vary between -32768 and +32767.
To calculate the volume you can take intervals of say 1000 samples, and calculate their RMS value. Sum the squared sample values divide by 1000 and take the square root. check this number against you threshold.
Typically one measures the energy of waves using root mean square.
If you want to be more perceptually accurate you can take the time-domain signal through a discrete fourier transform to a frequency-domain signal, and integrate over the magnitudes with some weighting function (since low-frequency waves are perceptually louder than high-frequency waves at the same energy).
But I don't know audio stuff either so I'm just making stuff up. ☺
I might try applying a standard-deviation sliding-window. OTOH, I would not have assumed that 255 = loudest. It may be, but I'd want to know what encoding is being used. If any compression is present, then I doubt 255 is "loudest."
Related
I need to create a sound analyzer to isolate certain song frequencies. For now, I'm interested in bass (60-250Hz).
I read the signal (IEEE float), for each block of 1024: do a FFT, and then extract the value corresponding to each frequency.
What I don't understand is this: I know FFT needs powers of 2 in order to work. I've seen code using blocks of 512, code using 2048, 4096 and so on.
I've settled on 1024 (which gives me roughly 47 datapoints/second). Am I correct in assuming that using, 2048, for instance will work just the same, giving me 23.5 datapoints/second, and the only difference is accuracy (and speed of computation of course)?
Also, am I required to read at 1024-boundary blocks? Like, for instance, say I simply skip the first 200 floats, will the results end up being very similar? (my tests seem to say yes)
LATER EDIT: updated title to make it easier to understand
1024/48kHz is barely longer than one period of a 60 Hz signal. Too short to determine if the signal is even fully periodic (repeats). Humans typically require somewhere around 6 periods of repetitions to hear a sound as a having a definite pitch.
60 Hz is B1. You might need 2 Hz resolution to separate B1 from C1 with a clear gap in between the two nearest FFT frequency bins. To do that, just using FFT magnitude results, would require an FFT of 48kHz/2Hz or a half second, or longer. The nearest power of 2, for 48ksps samples, is 32768.
For music pitch frequencies, there are much better pitch detector/estimators than using a bare FFT or FFT frequency peak magnitude, as they solve the missing or weak fundamental issue common in recorded instrumental or vocal music. Those pitch estimators can work with shorter time interval windows than a half second, but require more computation than a bare FFT magnitude peak picking.
I am developing a little application in Visual Studio 2010 in C# to draw a spectrogram (frequency "heat map").
I have already done the basic things:
Cut a rectangular windowed array out of the input signal array
Feed that array into FFT, which returns complex values
Store magnitude values in an array (spectrum for that window)
Step the window, and store the new values in other arrays, resulting in a jagged array that holds every step of windowing and their spectra
Draw these into a Graphics object, in color that uses the global min/max values of the heat map as relative cold and hot
The LEFT side of the screenshot shows my application, and on the RIGHT there is a spectrogram for the same input (512 samples long) and same rectangular window with size 32 from a program called "PAST - time series analysis" (https://folk.uio.no/ohammer/past/index.html). My 512 long sample array only consists of integer elements ranging from around 100 to 1400.
(Note: the light-blue bar on the very right of the PAST spectrogram is only because I accidentally left an unnecessary '0' element at the end of thats input array. Otherwise they are the same.)
Link to screenshot: https://drive.google.com/open?id=1UbJ4GyqmS6zaHoYZCLN9c0JhWONlrbe3
But I have encountered a few problems here:
The spectrogram seems very undetailed, related to another one that I made in "PAST time series analysis" for reference, and that one looks extremely detailed. Why is that?
I know that for an e.g. 32 long time window, the FFT returns 32 elements, the 0. elem is not needed here, the next 32/2 elements have the magnitude values I need. But this means that the frequency "resolution" on the output for a 32 long window is 16. That is exactly what my program uses. But "PAST" program shows a lot more detail. If you look at the narrow lines in the blue background, you can see that they show a nice pattern in the frequency axis, but in my spectrogram that information remains unseen. Why?
In the beginning (windowSize/2) wide window step-band and the ending (windowSize/2) step-band, there are less values for FFT input, thus there is less output, or just less precision. But in the "PAST" program those parts also seem relatively detailed, not just stretched bars like in mine. How can I improve that?
The 0. element of the FFT return array (the so called "DC" element) is a huge number, which is a lot bigger than the sample average, or even its sum. Why is that?
Why are my values (e.g. the maximum that you see near the color bar) so huge? That is just a magnitude value from the FFT output. Why are there different values in the PAST program? What correction should I use on the FFT output to get those values?
Please share your ideas, if you know more about this topic. I am very new to this. I only read first about Fourier transform a little more than a week ago.
Thanks in advance!
To get more smoothness in the vertical axis, zero pad your FFT so that there are more (interpolated) frequency bins in the output. For instance, zero pad your 32 data points so that you can use a 256 point or larger FFT.
To get more smoothness in the horizontal axis, overlap your FFT input windows (75% overlap, or more).
For both, use a smooth window function (Hamming or Von Hann, et.al.), and try wider windows, more than 32 (thus even more overlapped).
To get better coloring, try using a color mapping table, with the input being the log() of the (non zero) magnitudes.
You can also use multiple different FFTs per graph XY point, and decide which to color with based on local properties.
Hello LimeAndConconut,
Even though I do not know about PAST, I can provide you with some general information about FFT. Here is an answer for each of your points
1- You are right, a FFT performed on 32 elements returns 32 frequencies (null frequency, positive and negative components). It means that you already have all the information in your data, and PAST cannot get more information with the same 32 sized window. That's why I suspect the data to be interpolated for plotting, but this just visual. Once again PAST cannot create more information than the one you have in your data.
2- Once again I agree with you. On the borders, you have access to less frequency components. You can decide different strategies: not show data at the borders, or extend this data with zero-padding or circular padding
3- The zero element of the FFT should be the sum of your 32 windowed array. You need to check FFT normalization, have a look at the documentation of your FFT function.
4- Once again check the FFT normalization. Since PAST colorbar exhibit negative values, it seems to be plotted in logarithmic scale. This is common usage to use logarithm for plotting data with high dynamics in order to enhance details.
How can we identify silence packet in a byte array (Buffer provided by WaveInEventArgs) using Naudio. basically I am trying to loop through the array and checking for 0 values in the array. Is that correct?
I'm not sure what you mean by "packet", but finding silence is usually a matter of looking for consecutive samples having an absolute value less than a "threshold" amount. 0.00006 is -84.437 dB, so silence detection can be done on most audio with that value (though you should feel free to adjust that threshold to fit your audio). Depending on exactly what you are doing, you'll want to see a sequence of anywhere from 440 to 48000 "silent" samples before deciding it's a silent "packet".
I have made some research but I couldn't find what I am exactly looking for. At the moment, I have to send channel values by com port.
For example:
the content of file freqs.ini
low=0-xx khz;
mid=xx-yy khz;
high=yy-zz khz;
Then I will get values by percentage like
the expecting values
lowPercent = 10;
midPercent = 77;
highPercent = 53;
So, I will be able to send these values by rs232 and my room will turn into club :) (I am using this code to illuminate LED strips). I have found some spectrum analyser projects but they all have 9 channels, that is, 3*3 combinations from low-low to high-high.
I know how to communicate with com port, but how can I get integer values of 3 frequency range I have set before?
I don't know if you still need of that but....
Do you want to know how to get a real-time spectral analsys of sound?
1.implement a queue to take a buffer of audio samples
2.take the product of buffer and a proper window function (tipically , Hamming or Hann) calculated by your program as float array
3.do FFT of yelded array: there are may algortihms out there for every language....find the best one for you, use it and take the square module from each output coefficent ( Real_part^2 + Imaginary_part^2 , if FFT returns you algebrical representation of coefficients)
sum coefficients across your bands: to know what coefficient is associated to a frequency you've just got to know that the k-th coefficient is at (SampFrequency/BufferLength)*k Hz.....so it's easy to find band boundaries
if you need to normalize in [0 , 1] interval, you have to do nothing but divide each of 3 yelded bands value for maximum value between the 3
pop your buffer queue by a Shift value that is Shift <= BufferLength and start again
the number of coefficients coming from FFT alg is equal to BufferLength (this is beacause the Discrete Fourier Transform definition) so, the frequency resolution is better when you select a long buffer, but the program goes slower. The light intensity wont' vary after BufferLength audio frames, buf after Shift audio frames.....and high ratio beetwen BufferLength gives you slowly light changes....so you must select parameters that fits your desires, remembering that you have just to turn on & off some light....make your alg fast and lo-fi!
The last thing to do is discover freqeuncy bands from your mixer's eq knobs....i don't remember if this information was on mixers handbooks
I am making a pitch detection program using fft. To get the pitch I need to find the lowest frequency that is significantly above the noise floor.
All the results are in an array. Each position is for a frequency. I don't have any idea how to find the peak.
I am programming in C#.
Here is a screenshot of the frequency analysis in audacity.
Instead of attempting to find the lowest peak, I would look for a fundamental frequency which maximizes the spectral energy captured by its first 5 integer multiples. Note that every peak is an integer multiple of the lowest peak. This is a hack of the cepstrum method. Don't judge :).
N.B. From your plots, I assume a 1024 sample window and 44.1kHZ sampling Rate. This yields a frequency granularity of only 44.1kHz/1024 = 43Hz. Given a 44.1kHz audio, I recommend using a longer analysis window of ~50 ms or 2048 samples. This would yield a finer frequency granularity of ~21 Hz.
Assuming a Matlab vector 'psd' of size 2048 with the PSD values.
% 50 Hz (Dude) -> 50Hz/44100Hz * 2048 -> ~2 Lower Lim
% 300 Hz (Baby) -> 300Hz/44100Hz * 2048 -> ~14 Upper Lim
lower_lim = 2;
upper_lim = 14
for fund_cand = lower_lim:1:upper_lim
i_first_five_multiples = [1:1:5]*fund_cand;
sum_energy = sum(psd(i_first_five_multiples));
end
I would find the frequency which maximizes the sum_energy value.
It would be easier if you had some notion of the absolute values to expect, but I would suggest:
find the lowest (weakest) value first. It is your noise level.
compute the average level, it is your signal strength
define some function to decide the noise threshold. This is the tricky part, it may require some experimentation.
In a bad situation, signal may be only 2 or 3 times the noise level. If the signal is better you can probably use a threshold of 2xnoise.
Edit, after looking at the picture:
You should probably just start at the left and find a local maximum. Looks like you could use 30 dB threshold and a 10-bin window or something.
Finding the lowest peak won't work reliably for estimating pitch, as this frequency is sometimes completely missing, or down in the noise floor. For better reliability, try another algorithm: autocorrelation (AMDF, ASDF lag), cepstrum (FFT log FFT), harmonic product spectrum, state space density, and variations thereof that use neural nets, genetic algorithms or decision matrices to decide between alternative pitch hypothesis (RAPT, YAAPT, et.al.).
Added:
That said, you could guess a frequency, compute the average and standard deviation of spectral magnitudes for, say, a 2-to-1 frequency range around your guess, and see if there exists a peak significantly above the average (2 sigma?). Rinse and repeat for some number of frequency guesses, and see which one, or the lowest of several, has a peak that stands out the most from the average. Use that peak.