Detect when speech ended in Windows Text-to-Speech for Unity? - c#

How would one use the Microsoft Speech API in C# Unity in a way that lets one lip-sync a character, i.e. detect when the speech ended (or have the result available as raw audio file, or have the audio output be monitorable)? For instance, this library lets one use the Microsoft Speech API in Unity, but the audio has no apparent return object, and checking WindowsVoice.GetStatusMessage() in an interval crashes Unity for me. Thanks!

Related

Disable general speech commands in Hololens

I am making an application with Unity and MRTK for Hololens 2.
I would like to know if it is possible to disable general speech commands.
That is, I want to keep my own voice commands that I have created but I don't want phrases like "Take a picture", "Take a video".... to be recognized.
I've searched the internet but haven't found anything about it.
Does anyone know if there is an option to do this?
Speech commands in the Unity UWP app are powered by the same engine that supports speech in Windows 10 System. It means if you disable the Speech feature in Settings > Privacy > Speech, the speech recognition will no longer be available in your Unity app. Apart from this, HoloLens does not provide any way to allow you to modify system commands for now. Therefore, it is recommended you avoid using system commands in your app, or consider using the Unity plug-in for the Cognitive Speech Services SDK or other third-party speech engines.
You can go setting page,then go to camera ,then find "choose which apps can acess your camera" ,then uncheck Mixed Reality Camera.

System.Speech.Recognition; background control or voice recognition

I'm not sure if it is possible but anyway,
I use using System.Speech.Recognition; in winform C# app.
I'm wondering if it is possible not only to recognize speech, but also recognize voice, somehow recognize difference between different voices
to get something near to reading of multiply content from each separate voice, for example from two simultaneously or separately speaking users as different two.
Or at least maybe some method to control background loudness, for example if AudioLevelUpdated event allows me to see input volume, but maybe also exist some specific way to separate loud voice from extra noise or voices in background
System.Speech.Recognition will not help you in voice recognition.
System.Speech.Recognition is intended for speech to text. Adding grammar to it improves its efficiency. You can train the Windows desktop for better conversion. Refer Speech Recognition in Control Panel.
There are couple of 3rd party libraries available for voice recognition.
For removal of noise, you can refer Sound visualizer in C#.
You can find an interesting discussion at msdn forum.
I think you should take a look at CRIS which is part of Microsoft Cognitive Services, at least for you question about noise.
CRIS is a Custom Speech Service and its basic use is to improve the quality of Speech-to-text using custom acoustics models (like background noise) and learning vocabulary using samples.
You can import :
Acoustic Datasets
Language Datasets
Pronunciation Datasets
For example in acoustic models you have:
Microsoft Conversational Model for recognizing speech spoken in a conversational style (i.e. speech directed at another person).
Microsoft Search and Dictation Model for speech directed to an application, such as commands, search queries or dictation.
There is also a Speaker Recognition API available in preview

Microsoft Speech Recognition: wildcard blank content

In my Speech Engine I activate / desactivate multiple grammars.
In a special step I'd like to run a Grammar with ONLY to capture Audio of next given sentence according to engine's properties.
But to start/stop matching something, I assume engine needs "words". So I don't know how to do it ?
(Underlaying explanation: my application convert all garbage audio to text using google speech API because dictation is too bad and no available on Kinect)
Well, actually, no, the SR engine only needs to know that the incoming audio is "speech-like" (usually determined by the spectral characteristics of the audio). In particular, you could use the AudioPosition property and the SpeechDetected and RecognitionRejected events to send all rejected audio to the google speech API.
So your workflow would look like this:
Ask question of user.
Enable appropriate grammars.
Wait for recognition or recognition rejected.
If recognition, process accordingly
If recognition rejected, collect retained audio & send to google speech API.

Silverlight Speech Recognition (In browser)

Since this topic is a bit out dated I would like to re-discuss it here.
After searching the web, I came across the following link:
http://archive.msdn.microsoft.com/nesl which runs only out of browser because Silverlight (in browser) can't access certain COM libraries that are related to windows.
I wish (for obvious performance purposes) to perform the speech recognition through Silverlight (on the client machine) and then send the result (text) to the server via a postback to perform the corresponding action.
I already achieved a way to get the voice from the microphone and store it in Silverlight in a byte array. Is there a way to convert the speech byte array to text?
HTML5 Google service is not an acceptable approach since IE is required.
My final goal is to implement a speech recognition in ASP.NET Web Application.
Any suggestion is appreciated.
You can't do it in Silverlight. You'll need to send the audio somewhere. You can call some third-party service (I'm sure there are plenty--and it shouldn't matter that you're using IE) or your own ASP.NET (which can call System.Speech or any other free or commercial system). But before you do that, you should compress the audio. There aren't a lot of options in Silverlight. I recommend NSpeex, or at least convert it to 16kHz PCM (either linear or a-law).
Here's a list of Speech SDKs (many of which have a cloud service component): http://www.toolsjournal.com/mobile-articles/item/918-top-10-sdks-to-voice-enable-mobile-apps-quickly
To make Trusted In-browser Silverlight application:
http://msdn.microsoft.com/en-us/library/gg192793(v=vs.95).aspx
http://www.pitorque.de/MisterGoodcat/post/Silverlight-5-Tidbits-Trusted-applications.aspx
And for security background:
http://msdn.microsoft.com/en-us/library/ee721083%28v=vs.95%29.aspx
Note that NESL doesn't support DictionaryGrammar. Grammar needs to be pre-defined:
http://archive.msdn.microsoft.com/nesl/Thread/View.aspx?ThreadId=4905

Speech Recognition in C# without using windows' speech recognition

I know the first comment will be that am duplicating previous threads, but the codes I found (from MSDN) uses window's speech recognition... I'm doing my graduation project and speech recognition is part of it! and I cant include this code,I have to try and do it from scratch, am doing some researches about it and I would be really thankful if someone have already done this and can give me a link for a paper or a code I can benefit from !
Thanks in advance!
Microsoft Server Speech Platform 10.1 (SR and TTS in 26 languages)
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24003
The basic operations that speech recognition applications perform:
Starting the speech recognizer.
Creating a recognition grammar.
Loading the grammar into a speech recognizer.
Registering for speech recognition event notification.
Creating a handler for the speech recognition event.
Language Packs
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=3971
Runtime Download
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24974
These libraries could, at least, give you a starter to understand the makeup of the interfaces, and a starter of the core/base code to copy/steal or emulate ;)
There's also this paper:
http://www.cs.nyu.edu/~mohri/pub/hbka.pdf
Best of Luck!
Writing the code that implements the basic recognition algorithm (Hidden Markov Model based recognizers are the norm these days) is only part of your challenge. Virtually every speech recognition system is trained on actual speech data, so you also have to identify a corpus (collection of audio files and transcriptions) to train your mathematical models.
Have a look at the open source Sphinx speech recognizer (and related tools) from CMU if you are still interested in doing it all by yourself.

Categories

Resources