I'm trying to change the pitch of spoken text via SSML and the .NET SpeechSynthesizer (System.Speech.Synthesis)
SpeechSynthesizer synthesizer = new SpeechSynthesizer();
PromptBuilder builder = new PromptBuilder();
builder.AppendSsml(#"C:\Users\me\Documents\ssml1.xml");
synthesizer.Speak(builder);
The content of the ssml1.xml file is:
<?xml version="1.0" encoding="ISO-8859-1"?>
<ssml:speak version="1.0"
xmlns:ssml="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<ssml:sentence>
Your order for <ssml:prosody pitch="+30%" rate="-90%" >8 books</ssml:prosody>
will be shipped tomorrow.
</ssml:sentence>
</ssml:speak>
The rate is recognized: "8 books" is speaken much slower than the rest, but no matter what value is set for "pitch", it makes no difference ! Allowed values can be found here:
http://www.w3.org/TR/speech-synthesis/#S3.2.4
Am I missing something or is changing the pitch just not supported by the Microsoft Speech engine ?
fritz
While the engine SsmlParser used by System.Speech accepts a pitch attribute in the ProcessProsody method, it does not process it.
It only processes the range, rate, volume and duration attributes. It also parses contour but is processed as range (not sure why)...
Edit: if you don't really need to read the text from a SSML xml file, you can create the text programatically.
Instead of
builder.AppendSsml(#"C:\Users\me\Documents\ssml1.xml");
use
builder.Culture = CultureInfo.CreateSpecificCulture("en-US");
builder.StartVoice(builder.Culture);
builder.StartSentence();
builder.AppendText("Your order for ");
builder.StartStyle(new PromptStyle() { Emphasis = PromptEmphasis.Strong, Rate = PromptRate.ExtraSlow });
builder.AppendText("8 books");
builder.EndStyle();
builder.AppendText(" will be shipped tomorrow.");
builder.EndSentence();
builder.EndVoice();
Related
I use SpeechSynthesizer.SpeakSsml(String).
However not able to build an SSML containing pitch.
So far I could not find any working example online.
MS's Speech Synthesizer isn't built to do singing synthesis, but you can change the pitch characteristics using the <prosody> element:
SpeechSynthesizer.SpeakSsml("<speak version=\"1.0\" xml:lang=\"en\"><prosody pitch=\"x-low\">Hello World</prosody>.<prosody pitch=\"x-high\">Hello World</prosody></speak>");
I want to chose a specific sample rate for my audio card programmatically in c# with Naudio.
My output is a WasapiOut in exclusive mode.
I already tried a lot of things, but nothing worked and I've searched everywhere and I only found this : How to Change Speaker Configuration in Windows in C#?
But they didn't really find a right solution.
Here's my WasapiOut :
var enumerator = new MMDeviceEnumerator();
MMDevice device = enumerator.EnumerateAudioEndPoints(DataFlow.Render, DeviceState.Active).FirstOrDefault(d => d.DeviceFriendlyName == name);
outputDevice = new WasapiOut(device, AudioClientShareMode.Exclusive, false,200);
What I don't understand is that here :
https://github.com/naudio/NAudio/blob/master/Docs/WasapiOut.md
It says that :
"If you choose AudioClientShareMode.Exclusive then you are requesting exclusive access to the sound card. The benefits of this approach are that you can specify the exact sample rate you want"
And I didn't find anywhere how to specify the sample rate.
If someone here know the answer it would be great, thanks !
Edit :
I think I found a way by doing this :
var waveFormat5 = WaveFormat.CreateIeeeFloatWaveFormat(Int32.Parse(comboBox1.Text), 2);
var test2 = new MixingSampleProvider(waveFormat5);
var audioFile = new AudioFileReader("test.wav");
var input = audioFile;
test2.ReadFully = true;
test2.AddMixerInput(new AutoDisposeFileReader(input,waveFormat5));
outputDevice.Init(test2);
With "outputDevice" as my WasapiOut.
So I set the ouputDevice sample rate to the one that I chose with the Mixing Sample Provider and then I send an audiofile to that Mixer, is that the right way to do it ?
Because my audiofile sample rate is at 44100, and I chose to put my outputDevice sample rate to also 44100, but when I make outputDevice.Play(), the sound that I ear is faster than the original.
Once you've created an instance of WasapiOut you call Init passing the audio you want to play. It will try to use the sample rate (and WaveFormat) of that audio directly, assuming the soundcard supports it. Usi
I solved my problem, I used an AudioPlaybackEngine (https://markheath.net/post/fire-and-forget-audio-playback-with) with a MixingSampleProvider, and a try/catch to handle the message error of "the inputs are not a the same sample rate".
In a text to speech application by C# I use SpeechSynthesizer class, it has an event named SpeakProgress which is fired for every spoken word. But for some voices the parameter e.AudioPosition is not synchronized with the output audio stream, and the output wave file is played faster than what this position shows (see this related question).
Anyway, I am trying to find the exact information about the bit rate and other information related to the selected voice. As I have experienced if I can initialize the wave file with this information, the synchronizing problem will be resolved. However, if I can't find such information in the SupportedAudioFormat, I know no other way to find them. For example the "Microsoft David Desktop" voice provides no supported format in the VoiceInfo, but it seems it supports a PCM 16000 hz, 16 bit format.
How can I find audio format of the selected voice of the SpeechSynthesizer
var formats = CurVoice.VoiceInfo.SupportedAudioFormats;
if (formats.Count > 0)
{
var format = formats[0];
reader.SetOutputToWaveFile(CurAudioFile, format);
}
else
{
var format = // How can I find it, if the audio hasn't provided it?
reader.SetOutputToWaveFile(CurAudioFile, format );
}
Update: This answer has been edited following investigation. Initially I was suggesting from memory that SupportedAudioFormats is likely just from (possibly misconfigured) registry data; investigation has shown that for me, on Windows 7, this is definitely the case, and is backed up acecdotally on Windows 8.
Issues with SupportedAudioFormats
System.Speech wraps the venerable COM speech API (SAPI) and some voices are 32 vs 64 bit, or can be misconfigured (on a 64 bit machine's registry, HKLM/Software/Microsoft/Speech/Voices vs HKLM/Software/Wow6432Node/Microsoft/Speech/Voices.
I've pointed ILSpy at System.Speech and its VoiceInfo class, and I'm pretty convinced that SupportedAudioFormats comes solely from registry data, hence it's possible to get zero results back when enumerating SupportedAudioFormats if either your TTS engine isn't properly registered for your application's Platform target (x86, Any or 64 bit), or if the vendor simply doesn't provide this information in the registry.
Voices may still support different, additional or fewer formats, as that's up to the speech engine (code) rather than the registry (data). So it can be a shot in the dark. Standard Windows voices are often times more consistent in this regard than third party voices, but they still don't necessarily usefully provide SupportedAudioFormats.
Finding this Information the Hard Way
I've found it's still possible to get the current format of the current voice - but this does rely on reflection to access the internals of the System.Speech SAPI wrappers.
Consequently this is quite fragile code! And I wouldn't recommend use in production.
Note: The below code does require you to have called Speak() once for setup; more calls would be needed to force setup without Speak(). However, I can call Speak("") to say nothing and that works just fine.
Implementation:
[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
public ushort wFormatTag;
public ushort nChannels;
public uint nSamplesPerSec;
public uint nAvgBytesPerSec;
public ushort nBlockAlign;
public ushort wBitsPerSample;
public ushort cbSize;
}
WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
var voiceSynthesis = synthesizer.GetType()
.GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
.GetValue(synthesizer, null);
var ttsVoice = voiceSynthesis.GetType()
.GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
.Invoke(voiceSynthesis, new object[] { false });
var waveFormat = (byte[])ttsVoice.GetType()
.GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
.GetValue(ttsVoice);
var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
pin.Free();
return format;
}
Usage:
SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");
To test it, I renamed Microsoft Anna's AudioFormats registry key under HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes, causing SpeechSynthesizer.Voice.SupportedAudioFormats to have no elements when queried. The below is the output in this situation:
0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio
You can't get this information from code. You can only listen to all formats (from poor format like 8 kHz to high quality format like 48 kHz) and observe where it stops getting better, which is what you did, I think.
Internally, the speech engine "asks" the voice for the original audio format only once, and I believe that this value is used only internally by the speech engine, and the speech engine does not expose this value in any way.
For further information:
Let's say you are a voice company.
You have recorded your computer voice at 16 kHz, 16 bit, mono.
The user can let your voice speak at 48 kHz, 32 bit, Stereo.
The speech engine does this conversion. The speech engine does not care if it really sounds better, it simply does the format conversion.
Let's say the user wants to let your voice speak something.
He requests that the file will be saved as 48 kHz, 16 bit, stereo.
SAPI / System.Speech calls your voice with this method:
STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
HRESULT hr = S_OK;
//Here we need to return which format our audio data will be that we pass to the speech engine.
//Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.
enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.
hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);
return hr;
}
This is the only place where you have to "reveal" what the recorded format of your voice is.
All the "Available formats" rather tell you which conversions your sound card / Windows can do.
I hope I explained it well?
As a voice vendor, you don't support any formats. You just tell they speech engine what format your audio data is so that it can do the further conversions.
I'm from Greece and I want to make an application which will use SAPI to interact with the user, but I can't find a way to change the language of SAPI from English to Greek.
My OS is by default Greek & English, and I have SAPI SDK installed; the Greek Language is supported by SAPI.
The problem is that SAPI doesn't automatically recognise the language passed to it, and reverts to saying the individual letters one-by-one.
Here is the code I'm using, with English text:
using SpeechLib;
SpVoice voice = new SpVoice();
voice.Speak("Pdf File Successfully Installed", SpeechVoiceSpeakFlags.SVSFlagsAsync);
voice.WaitUntilDone(30000);
This works, but when I pass Greek text to the function (eg "Να ενα κειμενο"), I see the problem occur.
You can set a language by passing SSML to the Speak API, and including the xml:lang attribute.
For example this should work:
SpVoice voice = new SpVoice();
voice.Speak(
"<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='el-GR'>"
+ "Να ενα κειμενο"
+ "</speak>",
SpeechVoiceSpeakFlags.SVSFlagsAsync|SpeechVoiceSpeakFlags.SVSFIsXML);
voice.WaitUntilDone(30000);
You can also switch language mid-speech. The documentation has this example:
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
For English, press 1.
<voice xml:lang="fr-FR" gender="female">
Pour le français, appuyez sur 2 </voice>
</speak>
For more, see here:
https://msdn.microsoft.com/en-us/library/jj127898.aspx
I am currently testing the SpeechRecognitionEngine by loading from an xml file a pretty simple rule. In fact it is a simple between ("decrypt the email", "remove encryption") or ("encrypt the email", "add encryption").
I have trained my Windows 7 PC and additionally added the words encrypt and decrypt as I realize they are very similar. The recognizer already has a problem with making a difference between these two.
The issue I am having is that it recognizes things too often. I have set the confidence to 0.93 because with my voice in a quiet room when saying the exact words sometimes only gets to 0.93. But then if I turn on the radio the voice of the announcer or a song can mean that this recognizer thinks it has heard with over 0.93 confidence with words "decrpyt the email".
Maybe Lady Gaga is backmasking Applause to secretly decrypt emails :-)
Can anyone help in working out how to do something to make this recognizer workable.
In fact the recognizer is also picking up keyboard noise as "decrypt the email". I don't understand how this is possible.
Further to my editing buddy there are at least two managed namespaces for MS Speech Microsoft.Speech and System.Speech - It is important for this question that it be know that it is System.Speech.
If the only thing the System.Speech recognizer is listening for is "encrypt the email", then the recognizer will generate lots of false positives. (Particularly in a noisy environment.) If you add a DictationGrammar (particularly a pronunciation grammar) in parallel, the DictationGrammar will pick up the noise, and you can check the (e.g.) name of the grammar in the event handler to discard the bogus recognitions.
A (subset) example:
static void Main(string[] args)
{
Choices gb = new Choices();
gb.Add("encrypt the document");
gb.Add("decrypt the document");
Grammar commands = new Grammar(gb);
commands.Name = "commands";
DictationGrammar dg = new DictationGrammar("grammar:dictation#pronunciation");
dg.Name = "Random";
using (SpeechRecognitionEngine recoEngine = new SpeechRecognitionEngine(new CultureInfo("en-US")))
{
recoEngine.SetInputToDefaultAudioDevice();
recoEngine.LoadGrammar(commands);
recoEngine.LoadGrammar(dg);
recoEngine.RecognizeCompleted += recoEngine_RecognizeCompleted;
recoEngine.RecognizeAsync();
System.Console.ReadKey(true);
recoEngine.RecognizeAsyncStop();
}
}
static void recoEngine_RecognizeCompleted(object sender, RecognizeCompletedEventArgs e)
{
if (e.Result.Grammar.Name != "Random")
{
System.Console.WriteLine(e.Result.Text);
}
}