Improving accuracy of a voice detection algorithm - c#

So I have this Mono audio file that contains people talking, talking pauses and then them talking again. While they are talking and while they're not talking, there are children crying in the background from time to time, cars' breaks' squealing, things you hear when you are outside.
My goal is to keep those parts when they are talking and to cut those parts when they are not talking out. It is not necessary to filter the background noises.
Basically my final goal is to have a cut list like this
Start in seconds, End in seconds
What have I tried?
I manually created a voice only file by fitting together all of those parts that contain speech.(10 seconds)
I manually created a noise only file by fitting together all of those parts that do not contain speech.(50 seconds)
I got the frequencies + amplitudes thereof by applying a Fast Fourier transform
I walk through the audio file each 100 ms and take a FFT snapshot
I put all values of one snapshot(in my case 512) in a List and feed it to a machine learning algorithm(numl) combined with a label (in the first case voice = true and in the second case voice = false)
Then I use my main audio file, do basically the same but this time use the result of my machine learning model to determine whether it is speech or not and output the time in seconds it realizes this.
My problem is that I get a lot of false positives and false negatives. It seems to recognize voice when there is none and vice versa.
Is the reason for this probably a badly trained model(I use a decision tree) or do I need to take other measures to get a better result?

The common misconception about speech is that it is treated as an unrelated sequence of data frames. The core property of speech is that it is a continuous process in time, not just an array of data points.
Any reasonable VAD should take account of that and use time-oriented classifiers like HMMs. In your case any classifier that takes time into account would it be a simple energy-based voice activity detection that monitors background level or a GMM-HMM based VAD will do way better than any static classifier.
For description of simple algorithms you can check Wikipedia.
If you are looking for a good sophisticated VAD implementation, you can find one in WebRTC project, this VAD was developed by Google:
https://code.google.com/p/webrtc/source/browse/trunk/webrtc/common_audio/vad/

Related

How do I transpose the pitch of a .wav in .net C#

I'm trying to create a small utility application (console app or winforms) in C# that will load a single cycle waveform. The user can then enter a chord, and the app will generate a "single cycle chord", ie: a perfectly loopable version of the chord.
A perfect fifth (is not a chord but 2 notes) for instance would be the original single cycle looped twice, mixed with a second copy of the single cycle, transposed 7 semitones up and looped 3 times in the same timeframe.
What I can't find is how to transpose the wave simply by playing it faster, just like most basic samplers do. What would be a good way to do this in C#?
I've tried NAudio and cscore, but can't find how to play the wave at a different pitch by playing it faster. The pitch shift and varispeed examples are not the thing I'm looking for because those either try to keep the length the same or try to keep the pitch the same.

How can I use my microphone to get the volume/amplitude or "loudness" in c#

I have been searching for answers for a long time now but every solution I get seems too complex for what I want to do or perhaps there is no "easier" way of doing it..
What I want to do is simply use my system microphone to get the volume or loudness (or whatever it is called) in the room. Then according to that volume, I want to adjust my system volume so that the sound from my system always "sounds the same" (the same loudness), no matter if a train passes by or an airplane flies over.
How do I get this loudness or volume in my room into a C# application to use that to change my system volume?
I am using C# and a laptop with a built in microphone.
It is better to use library to read the input from microphone. NAudio is probably the best one.
Calibrate input with determining microphone gain. [#MSalters Comment used]
Every second iterate over the waveform recorded in memory, then: square the amplitude (to get an energy), average the squared values and take the square root of that. (Or the log, to convert to dB) [#MSalters Comment used]
Depending on it, set system volume with WinAPI.

Is there a way to get the raw data from a DVI port?

I have nopt seeing anyone else trying to do this. It is completely possible I am apperoaching this the wrong way. Basically, I have a computer with a DVI input. If nothing is attached to the DVI input, then a program on the computer loads some images on screen. If an output source is connected to the DVI port, then my program should stop writing images and use the DVI video feed instead.
What mechanisms exist to determine if a DVI input exists, and if there is currently a valid video signal present? How can I read the video stream?
Or am I going about this the completely wrong way?
At a hardware level most video input subsystems, analog or digital, are capable of detecting the presence of an input signal, or at least something that has a lot of the characteristics of one.
For a digital standard, you have actual clocking data either on its own wire, or encoded in a serial data stream. If there appears to be a clock, and if its frequency is regular and reasonable would be a first test (though for some standards, reasonable can cover a huge range of frequencies).
Next, video (not just digital, even analog) has a repeating structure of lines and fields, so there should be two identifiable submultiples of the pixel clock, one corresponding to the start or end of each line, and the other to the start or end of each field (screen). Again, these might have their own wires, might have unique means of encoding (special voltages in the analog case), or might represent time gaps in the pixel data. Even if there were no sync and no retrace times, statistical analysis of the pixel data would probably give clues to the X and Y dimensions as many features in the picture would repeat.
Actual video input subsystems (think flatpanel monitor) can have even more complicated detection and auto-adapting circuits - they may for example resample the input in time to change the dots-per-line resolution, or they may even put it in a frame buffer and scale it in both X and Y.
What details of the inner workings of the video capture circuit are exposed to consumer, or even driver level software would depend a lot on the specifics of the chipset used - hopefully a data sheet is available. It's pretty likely though that somewhere there is a readable register bit that indicates if the input is capturing something that the circuit "thinks" is a video signal. You might even be able to read out parameters such as the X and Y resolution and scanning rates or pixel clock rate.
Similarly, the ability to get data out of the port would be chipset dependent, but if the port is going to be useful for anything, there is presumably an operating system driver for it which provides some sort of useful API to video consuming applications.

Detecting people crossing a line with OpenCV

I want to count number of people crossing a line from either side. I have a camera that is placed on ceiling and shooting for the floor where the line is (So camera sees just top of people heads; and so it is more of object detection than people detection).
Is there any sample solution for this problem or similar problems like this? So I can learn from them?
Edit 1: More than one person is crossing the line at any moment.
If nothing else but humans are subject to cross the line then you need not to detect people you only have to detect motion.
There are several approaches for motoin detection.
Probably the simplest one fits your goals. You simply calculate difference between successive frames of video stream and this way determine "motion mask" and thus detect line crossing event
As an improvement of this "algorithm" you may consider "running average" method.
To determine a direction of motion you can use "motion templates".
In order to increase accuracy of your detector you may try any background subtraction technique (which in turn is not a simple solution). For example, if there is some moving background which should be filtered out (e.g. using statistical learning)
All algorithms mentioned are included in OpenCV library.
UPD:
how to compute motion mask
Useful functions for determining motion direction cvCalcMotionGradient, cvSegmentMotion, cvUpdateMotionHistory (search docs). OpenCV library contains example code for motion analysis, see motempl.c
advanced background subtraction from "Learning OpenCV" book
I'm not an expert in video-based cv, but if you can reduce the problem into a finite set of images (for instance, entering frame, standing on line, exiting frame), then you can use one of many shape recognition algorithms. I know of Shape Context which is good, but I doubt if it subtle enough for this application (it won't tell the difference between a head and most other round objects).
Basically, try to extract key images from the video, and then test them with shape recognition algorithms.
P.S. Finding the key images might be possible with good motion detection methods.

C# WinForms application to display waveforms of playback and recorded sound

I wish to write a C# WinForms application that can play a WAV file. While playing the file, it shows a waveform (similar to an oscilloscope).
At the same time, a user can record sound via the microphone, attempting to follow the original sound played (like a karaoke). The program displays the waveform of the recorded sound real-time, so comparisons can be seen from the waveform display of the original
wave file and the recorded one by the user. The comparisons will be done as in the difference in time (the delay) of the original and recorded sound. The waveform displays don't have to be very advanced (there is no need for cut, copy or paste); just being able to see it with a timeline would suffice.
I hope this is clear enough. Please do not hesitate to ask for more clarification if it's not clear. Thank you very much.
You can do what you want with C#, but it isn't going to work like you think. There is effectively no relationship at all between how a recording looks in an oscilloscope-type display and how that recording sounds to a human ear. So, for example, if I showed you two WAV files displayed in an oscilloscope display and told you that one recording was of a tuba playing and the other was of a person speaking a sentence, you would have no idea which was which just from looking at them.
If you want to compare a user's sounds to a pre-recorded WAV, you have to get more sophisticated and do FFT analysis of both and compare the frequency spectra, but even that won't really work for what you're trying to do.
Update: after some thought, I don't think I fully agree with my above statements. What you want to do might sort of work if what you want to do is to use the oscilloscope-type effect to compare the pitch (or frequency) of the WAV and the person's voice. If you tuned the oscilloscope to show a relatively small number of wavelengths at a time (like 20, maybe), the user would be able to quickly see the effect of raising or lowering the pitch of their voice.
I have a small sample C# app that I wrote about 2 years ago that does something kind of like this, only it displays an FFT-produced spectrograph instead of an oscilloscope (the difference is basically that a spectrograph shows frequency-domain information while an oscilloscope shows time-domain information). It's realtime, so you can talk/sing/whatever into a microphone and watch the spectrograph change dynamically.
I can dig this out and post the code here if you like. Or if you want the fun of doing it all yourself, I can post some links to the code resources you'd need.
The NAudio library has plenty of functionality that will (possibly) give you what you need. I've used it in the past for some simple operations, but it is much more powerful than I've had need to use.
#ZombieSheep
Naudio is indeed useful, but it has limitations. For example, there is not much control over the waveform display, it cannot be cleared and redrawn again. Besides, if it gets too long its impossible to scroll back to see the waveform in the front part. One more thing is that it only works with playing the sound but does not work with recording the sound.
Thank you.

Categories

Resources