Converting Arabic Words to Unicode format in C# - c#

I am designing an API where the API user needs Arabic text to be returned in Unicode format, to do so I tried the following:
public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}
The issue with the above code that it returns the unicode of letters regardless of its position in word.
Example: let us assume we have the following word:
"سمير" which consists of:
'س' which is written like 'سـ' because it is the first letter in word.
'م' which is written like 'ـمـ' because it is in the middle of word.
'ي' which is written like 'ـيـ' because it is in the middle of word.
'ر' which is written like 'ـر' because it is last letter of word.
The above code returns unicode of { 'س', 'م' , 'ي' , 'ر'} which is:
\u0633\u0645\u064A\u0631
instead of { 'سـ' , 'ـمـ' , 'ـيـ' , 'ـر'} which is
\uFEB3\uFEE4\uFEF4\uFEAE
Any ideas on how to update code to get correct Unicode?
Helpful link

The string is just a sequence of Unicode code points; it does not know the rules of Arabic. You're getting out exactly the data you put in; if you want different data out, then put different data in!
Try this:
Console.WriteLine("\u0633\u0645\u064A\u0631");
Console.WriteLine("\u0633\u0645\u064A\u0631".ToUnicodeString());
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE");
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE".ToUnicodeString());
As expected the output is
سمير
\u0633\u0645\u064A\u0631
ﺳﻤﻴﺮ
\uFEB3\uFEE4\uFEF4\uFEAE
Those two sequences of Unicode code points render the same in the browser, but they're different sequences. If you want to write out the second sequence, then don't pass in the first sequence.

Based on Eric's answer I knew how to solve my problem, I have created a solution on Github.
You will find a simple tool to run on Windows, and if you want to use the code in your projects then just copy paste UnicodesTable.cs and Unshaper.cs.
Basically you need a table of Unicodes for each Arabic letter then you can use something like the following extension method.
public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = #"\u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(#"\u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

Related

Equals() method not recognizing similar/same characters when comparing

Why comparing characters with .Equals always returns false?
char letter = 'a';
Console.WriteLine(letter.Equals("a")); // false
Overall I'm trying to write an English - Morse Code translator. I run into a problem comparing char values which shown above. I began with a foreach to analyze all the characters from a ReadLine() input, by using the WriteLine() method, all the characters were transposed fine, but when trying to compare them using the .Equals() method, no matter what I did, it always output false when trying to compare chars.
I have used the .Equals() method with other strings successfully, but it seems to not work with my chars.
using System;
public class MorseCode {
public static void Main (string[] args) {
Console.WriteLine ("Hello, write anything to convert it to morse code!");
var input = Console.ReadLine();
foreach (char letter in input) {
if(letter.Equals("a")) {
Console.WriteLine("Its A - live");
}
Console.WriteLine(letter);
}
var morseTranslation = "";
foreach (char letter in input) {
if(letter.Equals("a")) {
morseTranslation += ". _ - ";
}
if(letter.Equals("b")) {
morseTranslation += "_ . . . - ";
}
if(letter.Equals("c")) {
morseTranslation += "_ . _ . - ";
}
...
}
}
Console.WriteLine("In morse code, " + input + " is '"morseTranslation + "'");
}
}
At the beginning, I wrote the foreach to test if it recognized and ran the correct output, but in the end, when I wrote "sample" into the ReadLine(), it gave me :
Hello, write anything to convert it to morse code!
sample
s
a
m
p
l
e
When you do this:
var c = 'x';
var isEqual = c.Equals("x");
the result (isEqual) will always be false because it's comparing a string to a char. This would return true:
var isEqual = c.Equals('x');
The difference is that "x" is a string literal and 'x' is a char literal.
Part of what makes this confusing is that when you use an object's Equals method, it allows you to compare any type to any other type. So you could do this:
var x = 0;
var y = "y";
var isEqual = x.Equals(y);
...and the compiler will allow it, even though the comparison between int and string won't work. It will give you this warning:
When comparing value types like int or char with other values of the same type, we usually use ==, like
if (someChar == someOtherChar)
Then if you tried to do this:
if(someChar == "a")
It wouldn't compile. It would tell you that you're comparing a char to a string, and then it's easier because instead of running the program and looking for the error it just won't compile at all and it will tell you exactly where the problem is.
Just for the fun of it, here's another implementation.
public static class MorseCodeConverter
{
private static readonly Dictionary<char, string> Codes
= CreateMorseCodeDictionary();
public static string Convert(string input)
{
var lowerCase = input.ToLower();
var result = new StringBuilder();
foreach (var character in input)
{
if (Codes.ContainsKey(character))
result.Append(Codes[character]);
}
return result.ToString();
}
static Dictionary<char, string> CreateMorseCodeDictionary()
{
var result = new Dictionary<char, string>();
result.Add('a', ". _ - ");
result.Add('b', "_ . . . - ");
// add all the rest
return result;
}
}
One difference is that it's a class by itself without the console app. Then you can use it in a console app. Read the input from the keyboard and then call
MorseCodeConverter.Convert(input);
to get the result, and then you can print it to the console.a
Putting all of the characters in a dictionary means that instead of repeating the if/then you can just check to see if each character is in the dictionary.
It's important to remember that whilst the char and string keywords look reminiscant of eachother when looking at printed values you should note that they are not accomodated for in exactly the same way.
When you check a string you can use:
string s = "A";
if(s.Equals("A"))
{
//Do Something
}
However, the above will not work with a char. The difference between chars (value types) and strings (reference types) on a surface level is the use of access - single quote (apostrophe) vs quote.
To compare a char you can do this:
char s = 'A';
if(s.Equals('A'))
{
//Do Something
}
On a point relevant to your specific case however, morse code will only requre you to use a single case alphabet and as such when you try to compare against 'A' and 'a' you can call input.ToLower() to reduce your var (string) to all lower case so you don't need to cater for both upper and lower case alphabets.
It's good that you're aware of string comparissons and are not using direct value comparisson as this:
if (letter == 'a')
{
Console.WriteLine("Its A - live");
}
Would've allowed you to compare the char but it's bad practice as it may lead to lazy comparisson of strings in the same way and this:
if (letter == "a")
{
Console.WriteLine("Its A - live");
}
Is a non-representitive method of comparison for the purpose of comparing strings as it evaluates the reference not the direct value, see here
For char comparison you have to use single quote ' character not " this.
By the way it writes sample in decending order beacuse in your first foreach loop you write all letters in new line. SO below code will work for you:
using System;
public class MorseCode {
public static void Main (string[] args) {
Console.WriteLine ("Hello, write anything to convert it to morse code!");
var input = Console.ReadLine();
/*foreach (char letter in input) {
if(letter.Equals("a")) {
Console.WriteLine("Its A - live");
}
Console.WriteLine(letter);
}*/
var morseTranslation = "";
foreach (char letter in input) {
if(letter.Equals('a')) {
morseTranslation += ". _ - ";
}
if(letter.Equals('b')) {
morseTranslation += "_ . . . - ";
}
if(letter.Equals('c')) {
morseTranslation += "_ . _ . - ";
}
...
}
}
Console.WriteLine("In morse code, " + input + " is '"morseTranslation + "'");
}
}
In C#, you can compare strings like integers, that is with == operator. Equals is a method inherited from the object class, and normally implementations would make some type checks. char letter is (obviously) a character, while "a" is a single lettered string.
That's why it returns false.
You could use if (letter.Equals('a')) { ... }, or simpler if (letter == 'a') { ... }
Even simpler than that would be switch (letter) { case 'a': ...; break; ... }.
Or something that is more elegant but maybe too advanced yet for a beginner, using LINQ:
var validCharacters = "ABCDE...";
var codes = new string[] {
".-", "-...", "-.-.", "-..", ".", ...
};
var codes = input.ToUpper() // make uppercase
.ToCharArray() // explode string into single characters
.Select(validCharaters.IndexOf) // foreach element (i. e. character), get the result of "validCharacters.IndexOf",
// which equals the index of the morse code in the array "codes"
.Where(i => i > -1) // only take the indexes of characters that were found in "validCharacters"
.Select(i => codes[i]); // retrieve the matching entry from "codes" by index
// "codes" is now an IEnumerable<string>, a structure saying
// "I am a list of strings over which you can iterate,
// and I know how to generate the elements as you request them."
// Now concatenate all single codes to one long result string
var result = string.Join(" ", codes);

How to get parentheses inside parentheses

I'm trying to keep a parenthese within a string that's surrounded by a parenthese.
The string in question is: test (blue,(hmmm) derp)
The desired output into an array is: test and (blue,(hmmm) derp).
The current output is: (blue,, (hmm) and derp).
My current code is thatof this:
var input = Regex
.Split(line, #"(\([^()]*\))")
.Where(s => !string.IsNullOrEmpty(s))
.ToList();
How can i extract the text inside the outside parentheses (keeping them) and keep the inside parenthese as one string in an array?
EDIT:
To clarify my question, I want to ignore the inner parentheses and only split on the outer parentheses.
herpdediderp (orange,(hmm)) some other crap (red,hmm)
Should become:
herpdediderp, orange,(hmm), some other crap and red,hmm.
The code works for everything except the double parentheses: (orange,(hmm)) to orange,(hmm).
You can use the method
public string Trim(params char[] trimChars)
Like this
string trimmedLine = line.Trim('(', ')'); // Specify undesired leading and trailing chars.
// Specify separator characters for the split (here command and space):
string[] input = trimmedLine.Split(new[]{',', ' '}, StringSplitOptions.RemoveEmptyEntries);
If the line can start or end with 2 consecutive parentheses, use simply good old if-statements:
if (line.StartsWith("(")) {
line = line.Substring(1);
}
if (line.EndsWith(")")) {
line = line.Substring(0, line.Length - 1);
}
string[] input = line.Split(new[]{',', ' '},
Lot's o' guessing going on here - from me and the others. You could try
[^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\)
It handles one level of parentheses recursion (could be extended though).
Here at regexstorm.
Visual illustration at regex101.
If this piques your interest, I'll add an explanation ;)
Edit:
If you need to use split, put the selection in to a group, like
([^(]+|\([^(]*(?:\([^(]*\)[^(]*)*\))
and filter out empty strings. See example here at ideone.
Edit 2:
Not quite sure what behaviour you want with multiple levels of parentheses, but I assume this could do it for you:
([^(]+|\([^(]*(?:\([^(]*(?:\([^(]*\)[^(]*)*\)[^(]*)*\))
^^^^^^^^^^^^^^^^^^^ added
For each level of recursion you want, you "just" add another inner level. So this is for two levels of recursion ;)
See it here at ideone.
Hopefully someone will come up with a regex. Here's my code answer.
static class ExtensionMethods
{
static public IEnumerable<string> GetStuffInsideParentheses(this IEnumerable<char> input)
{
int levels = 0;
var current = new Queue<char>();
foreach (char c in input)
{
if (levels == 0)
{
if (c == '(') levels++;
continue;
}
if (c == ')')
{
levels--;
if (levels == 0)
{
yield return new string(current.ToArray());
current.Clear();
continue;
}
}
if (c == '(')
{
levels++;
}
current.Enqueue(c);
}
}
}
Test program:
public class Program
{
public static void Main()
{
var input = new []
{
"(blue,(hmmm) derp)",
"herpdediderp (orange,(hmm)) some other crap (red,hmm)"
};
foreach ( var s in input )
{
var output = s.GetStuffInsideParentheses();
foreach ( var o in output )
{
Console.WriteLine(o);
}
Console.WriteLine();
}
}
}
Output:
blue,(hmmm) derp
orange,(hmm)
red,hmm
Code on DotNetFiddle
I think if you think about the problem backwards, it becomes a bit easier - don't split on what you don't what, extract what you do want.
The only slightly tricky part if matching nested parentheses, I assume you will only go one level deep.
The first example:
var s1 = "(blue, (hmmm) derp)";
var input = Regex.Matches(s1, #"\((?:\(.+?\)|[^()]+)+\)").Cast<Match>().Select(m => Regex.Matches(m.Value, #"\(\w+\)|\w+").Cast<Match>().Select(m2 => m2.Value).ToArray()).ToArray();
// input is string[][] { string[] { "blue", "(hmmm)", "derp" } }
The second example uses an extension method:
public static string TrimOutside(this string src, string openDelims, string closeDelims) {
if (!String.IsNullOrEmpty(src)) {
var openIndex = openDelims.IndexOf(src[0]);
if (openIndex >= 0 && src.EndsWith(closeDelims.Substring(openIndex, 1)))
src = src.Substring(1, src.Length - 2);
}
return src;
}
The code/patterns are different because the two examples are being handled differently:
var s2 = "herpdediderp (orange,(hmm)) some other crap (red,hmm)";
var input3 = Regex.Matches(s2, #"\w(?:\w| )+\w|\((?:[^(]+|\([^)]+\))+\)").Cast<Match>().Select(m => m.Value.TrimOutside("(",")")).ToArray();
// input2 is string[] { "herpdediderp", "orange,(hmm)", "some other crap", "red,hmm" }

select alternate alphabet from string of word in asp.net

I want to generate alternate alphabets from generated word string. E.g. Word is SPACEORION then alphabet should be like this SPCO. Because I need to generate client code as per their name. What would be the suitable solution?
ok, from I understand, this might be what you want but the result of SPACEORION would be SAER and not SACO, so I hope I understood you correctly
string name = "SPACEORION ";
var shortName = "";
while (shortName.Length < 4)
{
foreach (char ch in name.ToCharArray())
{
if (name.IndexOf(ch) % 2 == 0)
{
shortName += ch.ToString();
}
}
}

Custom Uppercase on String

hi i was trying to make a program that modified a word in a string to a uppercase word.
the uppercase word is in a tag like this :
the <upcase>weather</upcase> is very <upcase>hot</upcase>
the result :
the WEATHER is very HOT
my code is like this :
string upKey = "<upcase>";
string lowKey = "</upcase>";
string quote = "the lazy <upcase>fox jump over</upcase> the dog <upcase> something here </upcase>";
int index = quote.IndexOf(upKey);
int indexEnd = quote.IndexOf(lowKey);
while(index!=-1)
{
for (int a = 0; a < index; a++)
{
Console.Write(quote[a]);
}
string upperQuote = "";
for (int b = index + 8; b < indexEnd; b++)
{
upperQuote += quote[b];
}
upperQuote = upperQuote.ToUpper().ToString();
Console.Write(upperQuote);
for (int c = indexEnd+9;c<quote.Length;c++)
{
if (quote[c]=='<')
{
break;
}
Console.Write(quote[c]);
}
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
}
Console.WriteLine();
}
i have been trying using this code,and a while(while (indexEnd != -1)) :
index = quote.IndexOf(upKey, index + 1);
indexEnd = quote.IndexOf(lowKey, index + 1);
but that not work, the program run into unlimited loop, btw i'm a noob so please give a answer that i can understand :)
You can use a regular expression for this:
string input = "the <upcase>weather</upcase> is very <upcase>hot</upcase>";
var regex = new Regex("<upcase>(?<theMatch>.*?)</upcase>");
var result = regex.Replace(input, match => match.Groups["theMatch"].Value.ToUpper());
// result will be: "the WEATHER is very HOT"
Here's an explanation taken from here for the regular expression used above:
<upcase> matches the characters <upcase> literally (case sensitive)
(?<theMatch>.\*?) Named capturing group theMatch
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
< matches the characters < literally
/ matches the character / literally
upcase> matches the characters upcase> literally (case sensitive)
The following will work as long as there are only matching tags and none of them are nested.
public static string Upper(string str)
{
const string start = "<upcase>";
const string end = "</upcase>";
var builder = new StringBuilder();
// Find the first start tag
int startIndex = str.IndexOf(start);
// If no start tag found then return the original
if (startIndex == -1)
return str;
// Append the part before the first tag as is
builder.Append(str.Substring(0, startIndex));
// Continue as long as we find another start tag.
while (startIndex != -1)
{
// Find the end tag for the current start tag
var endIndex = str.IndexOf(end, startIndex);
// Append the text between the start and end as upper case.
builder.Append(
str.Substring(
startIndex + start.Length,
endIndex - startIndex - start.Length).ToUpper());
// Find the next start tag.
startIndex = str.IndexOf(start, endIndex);
// Append the part after the end tag, but before the next start as is
builder.Append(
str.Substring(
endIndex + end.Length,
(startIndex == -1 ? str.Length : startIndex) - endIndex - end.Length));
}
return builder.ToString();
}
I'm not rewriting your code. Just answering your (main) question:
You need to keep a variable of the index you're at, and check for IndexOf from there only (See MSDN). Something like this:
int index = 0;
while (quote.IndexOf(upKey, index) != -1)
{
//Your code, including updating the value of index.
}
(I didn't check this on Visual Studio. This is just to point you in the direction that I think you're looking for.)
The reason for the infinite loop is that you're always testing IndexOf of the same index. Perhaps you mean to have quote.IndexOf(upKey, index += 1); which would change the value of index?
The way to go here is to probably use Regex but these easy parsing excercises are always fun to do manually. This can be easily solved using a very simple state machine.
What states can we have when dealing with strings of this nature? I can think of 4:
We are either parsing normal text
Or we are parsing an opening format tag '<...>'
Or we are parsing a closing format tag '</...>'
Or we are parsing text to be formatted between tags
I can't think of any other states. Now we need to think about the normal flow / transition between states. What should happen when we a parse string with the correct format?
Parser starts up expecting normal text. That is easy to understand.
If expecting normal text we encounter a '<' then the parser should switch to parsing opening format tag state. There is no other valid state transition.
If in parsing opening format tag state we encounter a '>' then the parser should switch to parsing text to be formatted. There is no other valid state transition.
If in parsing text to be formatted we encounter a '<' then the parser should switch to parsing closing tag. Again, there is no other valid state transition.
If in parsing closing tag we encounter a '>' then the parser should switch to normal text. Once more, there is no other valid transition. Note that we are disallowing nested tags.
Ok, so that seems pretty easy to understand. What do we need to implement this?
First we'll need something to represent the parsing states. A good old enum will do:
private enum ParsingState
{
UnformattedText,
OpenTag,
CloseTag,
FormattedText,
}
Now we need some string buffers to keep track of the final formatted string, the current format tag we are parsing and finally the substring we need to format. We will use several StringBuilder's for these as we don't know how long these buffers are and how many concatenations will be performed:
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
We will also need to keep track of the parser's state and the current active tag if any (so we can make sure that the parsed closing tag matches the current active tag):
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
And now we are good to go, but before we do, can we generalize this so it works with any format tag?
Yes we can, we just need to tell the parser what to do for each supported tag. We can do this easily just passing a along a Dictionary that ties each tag with the action it should perform. We do this the following way:
var formatter = new Dictionary<string, Func<string, string>>();
formatter.Add("upcase", s => s.ToUpperInvariant());
formatter.Add("lcase", s => s.ToLowerInvariant());
Great! Now our implementation could be the following:
public static string Parse(this string str, Dictionary<string, Func<string,string>> formatter)
{
var formattedStringBuffer = new StringBuilder();
var formatBuffer = new StringBuilder();
var tagBuffer = new StringBuilder();
var state = ParsingState.UnformattedText;
var activeFormatTag = string.Empty;
foreach (var c in str)
{
switch (state)
{
case ParsingState.UnformattedText:
{
if (c != '<')
{
formattedStringBuffer.Append(c);
}
else
{
state = ParsingState.OpenTag;
}
break;
}
case ParsingState.OpenTag:
{
if (c != '>')
{
tagBuffer.Append(c);
}
else
{
state = ParsingState.FormattedText;
activeFormatTag = tagBuffer.ToString();
tagBuffer.Clear();
}
break;
}
case ParsingState.FormattedText:
{
if (c != '<')
{
formatBuffer.Append(c);
}
else
{
state = ParsingState.CloseTag;
}
break;
}
case ParsingState.CloseTag:
{
if (c!='>')
{
tagBuffer.Append(c);
}
else
{
var expectedTag = $"/{activeFormatTag}";
var tag = tagBuffer.ToString();
if (tag != expectedTag)
throw new FormatException($"Expected closing tag not found: <{expectedTag}>.");
if (formatter.ContainsKey(activeFormatTag))
{
var formatted = formatter[activeFormatTag](formatBuffer.ToString());
formattedStringBuffer.Append(formatted);
tagBuffer.Clear();
formatBuffer.Clear();
state = ParsingState.UnformattedText;
}
else
throw new FormatException($"Format tag <{activeFormatTag}> not recognized.");
}
break;
}
}
}
if (state != ParsingState.UnformattedText)
throw new FormatException($"Bad format in specified string '{str}'");
return formattedStringBuffer.ToString();
}
Is it the most elegant solution? No, Regex will do a much better job, but being a beginner I would not recommend you start solving these kind of problems that way, you'll learn a whole lot more solving them manualy. You'll have plenty of time to learn Regex later on.

How can I convert PascalCase to split words?

I have variables containing text such as:
ShowSummary
ShowDetails
AccountDetails
Is there a simple way function / method in C# that I can apply to these variables to yield:
"Show Summary"
"Show Details"
"Account Details"
I was wondering about an extension method but I've never coded one and I am not sure where to start.
See this post by Jon Galloway and one by Phil
In the application I am currently working on, we have a delegate based split extension method. It looks like so:
public static string Split(this string target, Func<char, char, bool> shouldSplit, string splitFiller = " ")
{
if (target == null)
throw new ArgumentNullException("target");
if (shouldSplit == null)
throw new ArgumentNullException("shouldSplit");
if (String.IsNullOrEmpty(splitFiller))
throw new ArgumentNullException("splitFiller");
int targetLength = target.Length;
// We know the resulting string is going to be atleast the length of target
StringBuilder result = new StringBuilder(targetLength);
result.Append(target[0]);
// Loop from the second character to the last character.
for (int i = 1; i < targetLength; ++i)
{
char firstChar = target[i - 1];
char secondChar = target[i];
if (shouldSplit(firstChar, secondChar))
{
// If a split should be performed add in the filler
result.Append(splitFiller);
}
result.Append(secondChar);
}
return result.ToString();
}
Then it is could be used as follows:
string showSummary = "ShowSummary";
string spacedString = showSummary.Split((c1, c2) => Char.IsLower(c1) && Char.IsUpper(c2));
This allows you to split on any conditions between two chars, and insert a filler of your choice (default of a space).
The best would be to iterate through each character within the string. Check if the character is upper case. If so, insert a space character before it. Otherwise, move onto the next character.
Also, ideally start from the second character so that a space would not be inserted before the first character.
try something like this
var word = "AccountDetails";
word = string.Join(string.Empty,word
.Select(c => new string(c, 1)).Select(c => c[0] < 'Z' ? " " + c : c)).Trim();

Categories

Resources