Any libraries to convert number Pinyin to Pinyin with tone markings? - c#

Just wondering if anyone knows of a class library that can convert Chinese Pinyin to ones with tones, such as nin2 hao3 ma to nín hǎo ma. It would be similar to this answer, but hopefully using the .NET framework.

Here is my porting of #Greg-Hewgill python algorithm to C#. I haven't run into any issues so far.
public static string ConvertNumericalPinYinToAccented(string input)
{
Dictionary<int, string> PinyinToneMark = new Dictionary<int, string>
{
{0, "aoeiuv\u00fc"},
{1, "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6"},
{2, "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8"},
{3, "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da"},
{4, "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc"}
};
string[] words = input.Split(' ');
string accented = "";
string t = "";
foreach (string pinyin in words)
{
foreach (char c in pinyin)
{
if (c >= 'a' && c <= 'z')
{
t += c;
}
else if (c == ':')
{
if (t[t.Length - 1] == 'u')
{
t = t.Substring(0, t.Length - 2) + "\u00fc";
}
}
else
{
if (c >= '0' && c <= '5')
{
int tone = (int)Char.GetNumericValue(c) % 5;
if (tone != 0)
{
Match match = Regex.Match(t, "[aoeiuv\u00fc]+");
if (!match.Success)
{
t += c;
}
else if (match.Groups[0].Length == 1)
{
t = t.Substring(0, match.Groups[0].Index) +
PinyinToneMark[tone][PinyinToneMark[0].IndexOf(match.Groups[0].Value[0])]
+ t.Substring(match.Groups[0].Index + match.Groups[0].Length);
}
else
{
if (t.Contains("a"))
{
t = t.Replace("a", PinyinToneMark[tone][0].ToString());
}
else if (t.Contains("o"))
{
t = t.Replace("o", PinyinToneMark[tone][1].ToString());
}
else if (t.Contains("e"))
{
t = t.Replace("e", PinyinToneMark[tone][2].ToString());
}
else if (t.Contains("ui"))
{
t = t.Replace("i", PinyinToneMark[tone][3].ToString());
}
else if (t.Contains("iu"))
{
t = t.Replace("u", PinyinToneMark[tone][4].ToString());
}
else
{
t += "!";
}
}
}
}
accented += t;
t = "";
}
}
accented += t + " ";
}
accented = accented.TrimEnd();
return accented;
}

I've used Microsoft Visual Studio International Pack.
This is 1.0 version. and Feature Pack 2.0.
Hope help you!

I think this line
t = t.Substring(0, t.Length - 2) + "\u00fc";
Should be this instead
t = t.Substring(0, t.Length - 1) + "\u00fc";

Related

Make every other a-z letter Upper / Lower case, ignoring whitespace

Can somebody tell me what I am doing wrong please? can't seem to get the expected output, i.e. ignore whitespace and only upper/lowercase a-z characters regardless of the number of whitespace characters
my code:
var sentence = "dancing sentence";
var charSentence = sentence.ToCharArray();
var rs = "";
for (var i = 0; i < charSentence.Length; i++)
{
if (charSentence[i] != ' ')
{
if (i % 2 == 0 && charSentence[i] != ' ')
{
rs += charSentence[i].ToString().ToUpper();
}
else if (i % 2 == 1 && charSentence[i] != ' ')
{
rs += sentence[i].ToString().ToLower();
}
}
else
{
rs += " ";
}
}
Console.WriteLine(rs);
Expected output: DaNcInG sEnTeNcE
Actual output: DaNcInG SeNtEnCe
I use flag instead of i because (as you mentioned) white space made this algorithm work wrong:
var sentence = "dancing sentence";
var charSentence = sentence.ToCharArray();
var rs = "";
var flag = true;
for (var i = 0; i < charSentence.Length; i++)
{
if (charSentence[i] != ' ')
{
if (flag)
{
rs += charSentence[i].ToString().ToUpper();
}
else
{
rs += sentence[i].ToString().ToLower();
}
flag = !flag;
}
else
{
rs += " ";
}
}
Console.WriteLine(rs);
Try a simple Finite State Automata with just two states (upper == true/false); another suggestion is to use StringBuilder:
private static string ToDancing(string value) {
if (string.IsNullOrEmpty(value))
return value;
bool upper = false;
StringBuilder sb = new StringBuilder(value.Length);
foreach (var c in value)
if (char.IsLetter(c))
sb.Append((upper = !upper) ? char.ToUpper(c) : char.ToLower(c));
else
sb.Append(c);
return sb.ToString();
}
Test
var sentence = "dancing sentence";
Console.Write(ToDancing(sentence));
Outcome
DaNcInG sEnTeNcE
I think you should declare one more variable called isUpper. Now you have two variables, i indicates the index of the character that you are iterating next and isUpper indicates whether a letter should be uppercase.
You increment i as usual, but set isUpper to true at first:
// before the loop
boolean isUpper = true;
Then, rather than checking whether i is divisible by 2, check isUpper:
if (isUpper)
{
rs += charSentence[i].ToString().ToUpper();
}
else
{
rs += sentence[i].ToString().ToLower();
}
Immediately after the above if statement, "flip" isUpper:
isUpper = !isUpper;
Linq version
var sentence = "dancing sentence";
int i = 0;
string result = string.Concat(sentence.Select(x => { i += x == ' ' ? 0 : 1; return i % 2 != 0 ? char.ToUpper(x) : char.ToLower(x); }));
Sidenote:
please replace charSentence[i].ToString().ToUpper() with char.ToUpper(charSentence[i])
Thanks #Dmitry Bychenko. Best Approach. But i thought as per the OP's (might be a fresher...) mindset, what could be the solution. Here i have the code as another solution.
Lengthy code. I myself don't like but still representing
class Program
{
static void Main(string[] args)
{
var sentence = "dancing sentence large also";
string newString = string.Empty;
StringBuilder newStringdata = new StringBuilder();
string[] arr = sentence.Split(' ');
for (int i=0; i< arr.Length;i++)
{
if (i==0)
{
newString = ReturnEvenModifiedString(arr[i]);
newStringdata.Append(newString);
}
else
{
if(char.IsUpper(newString[newString.Length - 1]))
{
newString = ReturnOddModifiedString(arr[i]);
newStringdata.Append(" ");
newStringdata.Append(newString);
}
else
{
newString = ReturnEvenModifiedString(arr[i]);
newStringdata.Append(" ");
newStringdata.Append(newString);
}
}
}
Console.WriteLine(newStringdata.ToString());
Console.Read();
}
//For Even Test
private static string ReturnEvenModifiedString(string initialString)
{
string newString = string.Empty;
var temparr = initialString.ToCharArray();
for (var i = 0; i < temparr.Length; i++)
{
if (temparr[i] != ' ')
{
if (i % 2 == 0 && temparr[i] != ' ')
{
newString += temparr[i].ToString().ToUpper();
}
else
{
newString += temparr[i].ToString().ToLower();
}
}
}
return newString;
}
//For Odd Test
private static string ReturnOddModifiedString(string initialString)
{
string newString = string.Empty;
var temparr = initialString.ToCharArray();
for (var i = 0; i < temparr.Length; i++)
{
if (temparr[i] != ' ')
{
if (i % 2 != 0 && temparr[i] != ' ')
{
newString += temparr[i].ToString().ToUpper();
}
else
{
newString += temparr[i].ToString().ToLower();
}
}
}
return newString;
}
}
OUTPUT

Split string in square brackets from Google translator

I am receiving a data from a Google Language Translator service and need help splitting the data.
void Start()
{
translateText("Hello, This is a test!", "en", "fr");
}
void translateText(string text, string fromLanguage, string toLanguage)
{
string url = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=" + fromLanguage + "&tl=" + toLanguage + "&dt=t&q=" + Uri.EscapeUriString(text);
StartCoroutine(startTranslator(url));
}
IEnumerator startTranslator(string url)
{
UnityWebRequest www = UnityWebRequest.Get(url);
yield return www.Send();
Debug.Log("Raw string Received: " + www.downloadHandler.text);
LanguageResult tempResult = decodeResult(www.downloadHandler.text);
Debug.Log("Original Text: " + tempResult.originalText);
Debug.Log("Translated Text: " + tempResult.translatedText);
Debug.Log("LanguageIso: " + tempResult.languageIso);
yield return null;
}
LanguageResult decodeResult(string result)
{
char[] delims = { '[', '\"', ']', ',' };
string[] arr = result.Split(delims, StringSplitOptions.RemoveEmptyEntries);
LanguageResult tempLang = null;
if (arr.Length >= 4)
{
tempLang = new LanguageResult();
tempLang.translatedText = arr[0];
tempLang.originalText = arr[1];
tempLang.unknowValue = arr[2];
tempLang.languageIso = arr[3];
}
return tempLang;
}
public class LanguageResult
{
public string translatedText;
public string originalText;
public string unknowValue;
public string languageIso;
}
then calling it with translateText("Hello, This is a test!", "en", "fr"); from the Start() function which converts the English sentence to French with ISO 639-1 Code.
The received data looks like this:
[[["Bonjour, Ceci est un test!","Hello, This is a test!",,,0]],,"en"]
I want to split it like this:
Bonjour, Ceci est un test!
Hello, This is a test!
0
en
and put them into a string array in order.
I currently use this:
char[] delims = { '[', '\"', ']', ',' };
string[] arr = result.Split(delims, StringSplitOptions.RemoveEmptyEntries);
This works if there is no comma in the received string. If there is a comma, the splitted values are messed up. What's the best way of splitting this?
EDIT:
With Blorgbeard's solution, the final working code is as below. Hopefully, this will help somebody else. This shouldn't be used for commercial purposes but for personal or school project.
void Start()
{
//translateText("Hello, This is \" / \\ a test !", "en", "fr");
//translateText("Hello, This is , \\ \" a test !", "en", "fr");
translateText("Hello, This is a test!", "en", "fr");
}
void translateText(string text, string fromLanguage, string toLanguage)
{
string url = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=" + fromLanguage + "&tl=" + toLanguage + "&dt=t&q=" + Uri.EscapeUriString(text);
StartCoroutine(startTranslator(url));
}
IEnumerator startTranslator(string url)
{
UnityWebRequest www = UnityWebRequest.Get(url);
yield return www.Send();
Debug.Log("Raw string Received: " + www.downloadHandler.text);
LanguageResult tempResult = decodeResult(www.downloadHandler.text);
displayResult(tempResult);
yield return null;
}
void displayResult(LanguageResult translationResult)
{
Debug.Log("Original Text: " + translationResult.originalText);
Debug.Log("Translated Text: " + translationResult.translatedText);
Debug.Log("LanguageIso: " + translationResult.languageIso);
}
LanguageResult decodeResult(string result)
{
string[] arr = Decode(result);
LanguageResult tempLang = null;
if (arr.Length >= 4)
{
tempLang = new LanguageResult();
tempLang.translatedText = arr[0];
tempLang.originalText = arr[1];
tempLang.unknowValue = arr[2];
tempLang.languageIso = arr[3];
}
return tempLang;
}
public class LanguageResult
{
public string translatedText;
public string originalText;
public string unknowValue;
public string languageIso;
}
private string[] Decode(string input)
{
List<string> finalResult = new List<string>();
bool inToken = false;
bool inString = false;
bool escaped = false;
var seps = ",[]\"".ToArray();
var current = "";
foreach (var chr in input)
{
if (!inString && chr == '"')
{
current = "";
inString = true;
continue;
}
if (inString && !escaped && chr == '"')
{
finalResult.Add(current);
current = "";
inString = false;
continue;
}
if (inString && !escaped && chr == '\\')
{
escaped = true;
continue;
}
if (inString && (chr != '"' || escaped))
{
escaped = false;
current += chr;
continue;
}
if (inToken && seps.Contains(chr))
{
finalResult.Add(current);
current = "";
inToken = false;
continue;
}
if (!inString && chr == '"')
{
inString = true;
current = "";
continue;
}
if (!inToken && !seps.Contains(chr))
{
inToken = true;
current = "";
}
current += chr;
}
return finalResult.ToArray();
}
You could code up a simple parser yourself. Here's one I threw together (could use some cleaning up, but demonstrates the idea):
private static IEnumerable<string> Parse(string input) {
bool inToken = false;
bool inString = false;
bool escaped = false;
var seps = ",[]\"".ToArray();
var current = "";
foreach (var chr in input) {
if (!inString && chr == '"') {
current = "";
inString = true;
continue;
}
if (inString && !escaped && chr == '"') {
yield return current;
current = "";
inString = false;
continue;
}
if (inString && !escaped && chr == '\\') {
escaped = true;
continue;
}
if (inString && (chr != '"' || escaped)) {
escaped = false;
current += chr;
continue;
}
if (inToken && seps.Contains(chr)) {
yield return current;
current = "";
inToken = false;
continue;
}
if (!inString && chr == '"') {
inString = true;
current = "";
continue;
}
if (!inToken && !seps.Contains(chr)) {
inToken = true;
current = "";
}
current += chr;
}
}
Here's a jsfiddle demo.
Using Regex.Split you could do something like this for example:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
var input ="[[[\"Bonjour, Ceci est un test!\",\"Hello, This is a test!\",,,0]],,\"en\"]";
var parse = Regex.Split(input, "\\[|\\]|[^a-zA-Z ],|\",\"|\"|\"");
foreach(var item in parse) {
bool result = !String.IsNullOrEmpty(item) && (Char.IsLetter(item[0]) || Char.IsDigit(item[0]));
if (result) {
Console.WriteLine(item);
}
}
}
}
Output:
Bonjour, Ceci est un test!
Hello, This is a test!
0
en
If you want everything that was split you can simply remove the bool check for alphacharacters.
Here is a crazy idea - split by " and then by the rest (but won't work if there is " between the "'s)
var s = #"[[[""Bonjour, Ceci est un test!"",""Hello, This is a test!"",,,0]],,""en""]";
var a = s.Split('"').Select((x, i) => (i & 1) > 0 ? new[] { x } : x.Split("[],".ToArray(),
StringSplitOptions.RemoveEmptyEntries)).SelectMany(x => x).ToArray();
Debug.Print(string.Join("|", a)); // "Bonjour, Ceci est un test!|Hello, This is a test!|0|en"
You can try regex for splitting. I tested with the sample you provided. It results like this.
var str="[[[\"Bonjour, Ceci est un test!\",\"Hello, This is a test!\",,,0]],,\"en\"]";
var splitted=Regex.Split(str,#"\[|\]|\,");
foreach(var split in splitted){
Console.WriteLine(split );
}
"Bonjour Ceci est un test!"
"Hello This is a test!"
0
"en"

Parsing CSV strings (not files) in C#

Using C#, I need to parse a CSV string that doesn't come from a file. I've found a great deal of material on parsing CSV files, but virtually nothing on strings. It seems as though this should be simple, yet thus far I can come up only with inefficient methods, such as this:
using Microsoft.VisualBasic.FileIO;
var csvParser = new TextFieldParser(new StringReader(strCsvLine));
csvParser.SetDelimiters(new string[] { "," });
csvParser.HasFieldsEnclosedInQuotes = true;
Are there good ways of making this more efficient and less ugly? I will be processing huge volumes of strings, so I wouldn't want to pay the cost of all the above. Thanks.
Here is a lightly tested parser that handles quotes
List<string> Parse(string line)
{
var columns = new List<string>();
var sb = new StringBuilder();
bool isQuoted = false;
int nQuotes = 0;
foreach(var c in line)
{
if (sb.Length == 0 && !isQuoted && c == '"')
{
isQuoted = true;
continue;
}
if (isQuoted)
{
if (c == '"')
{
nQuotes++;
continue;
}
else
{
if (nQuotes > 0)
{
sb.Append('"', nQuotes / 2);
if (nQuotes % 2 != 0)
{
isQuoted = false;
}
nQuotes = 0;
}
}
}
if (!isQuoted && c == ',')
{
columns.Add(sb.ToString());
sb.Clear();
continue;
}
sb.Append(c);
}
if (nQuotes > 0)
{
sb.Append('"', nQuotes / 2);
}
columns.Add(sb.ToString());
return columns;
}

Read Text File from specific places

I have a question about read a text file, because i dont know if i'm thinking right. I want to read from specific string to specific character.
My text would look like this:
...
...
CM_ "Hello, how are you?
Rules: Don't smoke!
- love others
End";
...
CM_ "Why you?";
...// Many CM_
...
After Splited should look like that:
1. CM_
2. "Hello, how are you?
Rules: Don't smoke!
- love others
End"
3. CM_
4. "Why you?"
... // many CM_
I want to read from "CM_" till ";"
My Code i tried so far:
StreamReader fin = new StreamReader("text.txt");
string tmp = "";
tmp = fin.ReadToEnd();
if (tmp.StartsWith("CM_ ") && tmp.EndWith(";"))
{
var result = tmp.Split(new[] { '"' }).SelectMany((s, i) =>
{
if (i % 2 == 1) return new[] { s };
return s.Split(new[] { ' ', ';' }, StringSplitOptions.RemoveEmptyEntries);
}).ToList();
}
foreach (string x in result)
{
Console.WriteLine(x);
}
static void PRegex()
{
using (StreamReader fin = new StreamReader("text.txt"))
{
string tmp = fin.ReadToEnd();
var matches = Regex.Matches(tmp, "(CM_) ([^;]*);", RegexOptions.Singleline);
for (int i = 0; i < matches.Count; i++)
if (matches[i].Groups.Count == 3)
Console.WriteLine((2 * i + 1).ToString() + ". " + matches[i].Groups[1].Value + "\r\n" + (2 * (i + 1)).ToString() + ". " + matches[i].Groups[2].Value);
}
Console.ReadLine();
}
static void PLineByLine()
{
using (StreamReader fin = new StreamReader("text.txt"))
{
int index = 0;
string line = null;
string currentCMBlock = null;
bool endOfBlock = true;
while ((line = fin.ReadLine()) != null)
{
bool endOfLine = false;
while (!endOfLine)
{
if (endOfBlock)
{
int startIndex = line.IndexOf("CM_ ");
if (startIndex == -1)
{
endOfLine = true;
continue;
}
line = line.Substring(startIndex + 4, line.Length - startIndex - 4);
endOfBlock = false;
}
if (!endOfBlock)
{
int startIndex = line.IndexOf(";");
if (startIndex == -1)
{
currentCMBlock += line + "\r\n";
endOfLine = true;
continue;
}
currentCMBlock += line.Substring(0, startIndex);
if (!string.IsNullOrEmpty(currentCMBlock))
Console.WriteLine((++index) + ". CM_\r\n" + (++index) + ". " + currentCMBlock);
currentCMBlock = null;
line = line.Substring(startIndex + 1, line.Length - startIndex - 1);
endOfBlock = true;
}
}
}
}
Console.ReadLine();
}
You are reading the whole file into tmp. So, if there is any text before "CM_" then your conditional statement won't be entered.
Instead, try reading line by line with fin.ReadLine in a loop over all lines.
Read the whole file:
string FileToRead = File.ReadAllText("Path");
string GetContent(string StartAt, string EndAt, bool LastIndex)
{
string ReturnVal;
if(LastIndex)
{
ReturnVal = FileToRead.Remove(FileToRead.IndexOf(StartAt), FileToRead.IndexOf(EndAt));
Return ReturnVal;
}
else
{
ReturnVal = FileToRead.Remove(FileToRead.LastIndex(StartAt), FileToRead.LastIndex(EndAt));
Return ReturnVal;
}
}
-Hope I didn't do anything wrong here. (Free mind typing)
You read the file, and we remove all the content, infront of the first index. and all after it.
You can set it if will return the FIRST result found. or the last.
NOTE: I think it would be better to use a StringReader. (If I don't remember wrong...)
If you are to think about the memory usage of your application.
I tried something else, don't know if this is good. It still read the first Line, dont know that i did wrong here
my Code:
while ((tmp = fin.ReadLine()) != null)
{
if (tmp.StartsWith("CM_ "))
{
//string[] tmpList = tmp.Split(new Char[] { ' ', ';' }, StringSplitOptions.RemoveEmptyEntries);
var result = tmp.Split(new[] { '"' }).SelectMany((s, i) =>
{
if (i % 2 == 1) return new[] { s };
return s.Split(new[] { ' ', ';' }, StringSplitOptions.RemoveEmptyEntries);
}).ToList();
if (tmp.EndsWith(";")) break;
fin.ReadLine();
if (tmp.EndsWith(";"))
{
result.ToList();
break;
}
else
{
result.ToList();
fin.ReadLine();
}
foreach (string x in result)
{
Console.WriteLine(x);
}
}
I suggest you look into using Regular Expressions. It may be just what you need and much more flexible than Split().

Create Space Between Capital Letters and Skip Space Between Consecutive

I get the way to create space "ThisCourse" to be "This Course"
Add Space Before Capital Letter By (EtienneT) LINQ Statement
But i cannot
Create Space Betweeen This "ThisCourseID" to be "This Course ID" without space between "ID"
And Is there a way to do this in Linq ??
Well, if it has to be a single linq statement...
var s = "ThisCourseIDMoreXYeahY";
s = string.Join(
string.Empty,
s.Select((x,i) => (
char.IsUpper(x) && i>0 &&
( char.IsLower(s[i-1]) || (i<s.Count()-1 && char.IsLower(s[i+1])) )
) ? " " + x : x.ToString()));
Console.WriteLine(s);
Output: "This Course ID More X Yeah Y"
var s = "ThisCourseID";
for (var i = 1; i < s.Length; i++)
{
if (char.IsLower(s[i - 1]) && char.IsUpper(s[i]))
{
s = s.Insert(i, " ");
}
}
Console.WriteLine(s); // "This Course ID"
You can improve this using StringBuilder if you are going to use this on very long strings, but for your purpose, as you presented it, it should work just fine.
FIX:
var s = "ThisCourseIDSomething";
for (var i = 1; i < s.Length - 1; i++)
{
if (char.IsLower(s[i - 1]) && char.IsUpper(s[i]) ||
s[i - 1] != ' ' && char.IsUpper(s[i]) && char.IsLower(s[i + 1]))
{
s = s.Insert(i, " ");
}
}
Console.WriteLine(s); // This Course ID Something
You don't need LINQ - but you could 'enumerate' and use lambda to make it more generic...
(though not sure if any of this makes sense)
static IEnumerable<string> Split(this string text, Func<char?, char?, char, int?> shouldSplit)
{
StringBuilder output = new StringBuilder();
char? before = null;
char? before2nd = null;
foreach (var c in text)
{
var where = shouldSplit(before2nd, before, c);
if (where != null)
{
var str = output.ToString();
switch(where)
{
case -1:
output.Remove(0, str.Length -1);
yield return str.Substring(0, str.Length - 1);
break;
case 0: default:
output.Clear();
yield return str;
break;
}
}
output.Append(c);
before2nd = before;
before = c;
}
yield return output.ToString();
}
...and call it like this e.g. ...
static IEnumerable<string> SplitLines(this string text)
{
return text.Split((before2nd, before, now) =>
{
if ((before2nd ?? 'A') == '\r' && (before ?? 'A') == '\n') return 0; // split on 'now'
return null; // don't split
});
}
static IEnumerable<string> SplitOnCase(this string text)
{
return text.Split((before2nd, before, now) =>
{
if (char.IsLower(before ?? 'A') && char.IsUpper(now)) return 0; // split on 'now'
if (char.IsUpper(before2nd ?? 'a') && char.IsUpper(before ?? 'a') && char.IsLower(now)) return -1; // split one char before
return null; // don't split
});
}
...and somewhere...
var text = "ToSplitOrNotToSplitTHEQuestionIsNow";
var words = text.SplitOnCase();
foreach (var word in words)
Console.WriteLine(word);
text = "To\r\nSplit\r\nOr\r\nNot\r\nTo\r\nSplit\r\nTHE\r\nQuestion\r\nIs\r\nNow";
words = text.SplitLines();
foreach (var word in words)
Console.WriteLine(word);
:)

Categories

Resources