Unicode to ASCII conversion/mapping - c#

I need some sort of conversion/mapping that, for example, is done by CLCL clipboard manager.
What it does is like that:
I copy the following Unicode text: ūī
And CLCL converts it to: ui
Is there any technique to do such a conversion? Or maybe there are mapping tables that can be used to convert, let's say, symbol ū is mapped to u.
UPDATE
Thanks to all for help. Here is what I came with (a hybrid of two solutions), one posted by Erik Schierboom and one taken from http://blogs.infosupport.com/normalizing-unicode-strings-in-c/#comment-8984
public static string ConvertUnicodeToAscii(string unicodeStr, bool skipNonConvertibleChars = false)
{
if (string.IsNullOrWhiteSpace(unicodeStr))
{
return unicodeStr;
}
var normalizedStr = unicodeStr.Normalize(NormalizationForm.FormD);
if (skipNonConvertibleChars)
{
return new string(normalizedStr.ToCharArray().Where(c => (int) c <= 127).ToArray());
}
return new string(
normalizedStr.Where(
c =>
{
UnicodeCategory category = CharUnicodeInfo.GetUnicodeCategory(c);
return category != UnicodeCategory.NonSpacingMark;
}).ToArray());
}

I have used the following code for some time:
private static string NormalizeDiacriticalCharacters(string value)
{
if (value == null)
{
throw new ArgumentNullException("value");
}
var normalised = value.Normalize(NormalizationForm.FormD).ToCharArray();
return new string(normalised.Where(c => (int)c <= 127).ToArray());
}

In general, it is not possible to convert Unicode to ASCII because ASCII is a subset of Unicode.
That being said, it is possible to convert characters within the ASCII subset of Unicode to Unicode.
In C#, generally there's no need to do the conversion, since all strings are Unicode by default anyway, and all components are Unicode-aware, but if you must do the conversion, use the following:
string myString = "SomeString";
byte[] asciiString = System.Text.Encoding.ASCII.GetBytes(myString);

Related

Converting Arabic Words to Unicode format in C#

I am designing an API where the API user needs Arabic text to be returned in Unicode format, to do so I tried the following:
public static class StringExtensions
{
public static string ToUnicodeString(this string str)
{
StringBuilder sb = new StringBuilder();
foreach (var c in str)
{
sb.Append("\\u" + ((int)c).ToString("X4"));
}
return sb.ToString();
}
}
The issue with the above code that it returns the unicode of letters regardless of its position in word.
Example: let us assume we have the following word:
"سمير" which consists of:
'س' which is written like 'سـ' because it is the first letter in word.
'م' which is written like 'ـمـ' because it is in the middle of word.
'ي' which is written like 'ـيـ' because it is in the middle of word.
'ر' which is written like 'ـر' because it is last letter of word.
The above code returns unicode of { 'س', 'م' , 'ي' , 'ر'} which is:
\u0633\u0645\u064A\u0631
instead of { 'سـ' , 'ـمـ' , 'ـيـ' , 'ـر'} which is
\uFEB3\uFEE4\uFEF4\uFEAE
Any ideas on how to update code to get correct Unicode?
Helpful link
The string is just a sequence of Unicode code points; it does not know the rules of Arabic. You're getting out exactly the data you put in; if you want different data out, then put different data in!
Try this:
Console.WriteLine("\u0633\u0645\u064A\u0631");
Console.WriteLine("\u0633\u0645\u064A\u0631".ToUnicodeString());
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE");
Console.WriteLine("\uFEB3\uFEE4\uFEF4\uFEAE".ToUnicodeString());
As expected the output is
سمير
\u0633\u0645\u064A\u0631
ﺳﻤﻴﺮ
\uFEB3\uFEE4\uFEF4\uFEAE
Those two sequences of Unicode code points render the same in the browser, but they're different sequences. If you want to write out the second sequence, then don't pass in the first sequence.
Based on Eric's answer I knew how to solve my problem, I have created a solution on Github.
You will find a simple tool to run on Windows, and if you want to use the code in your projects then just copy paste UnicodesTable.cs and Unshaper.cs.
Basically you need a table of Unicodes for each Arabic letter then you can use something like the following extension method.
public static string GetUnShapedUnicode(this string original)
{
original = Regex.Unescape(original.Trim());
var words = original.Split(' ');
StringBuilder builder = new StringBuilder();
var unicodesTable = UnicodesTable.GetArabicGliphes();
foreach (var word in words)
{
string previous = null;
for (int i = 0; i < word.Length; i++)
{
string shapedUnicode = #"\u" + ((int)word[i]).ToString("X4");
if (!unicodesTable.ContainsKey(shapedUnicode))
{
builder.Append(shapedUnicode);
previous = null;
continue;
}
else
{
if (i == 0 || previous == null)
{
builder.Append(unicodesTable[shapedUnicode][1]);
}
else
{
if (i == word.Length - 1)
{
if (!string.IsNullOrEmpty(previous) && unicodesTable[previous][4] == "2")
{
builder.Append(unicodesTable[shapedUnicode][0]);
}
else
builder.Append(unicodesTable[shapedUnicode][3]);
}
else
{
bool previouChar = unicodesTable[previous][4] == "2";
if (previouChar)
builder.Append(unicodesTable[shapedUnicode][1]);
else
builder.Append(unicodesTable[shapedUnicode][2]);
}
}
}
previous = shapedUnicode;
}
if (words.ToList().IndexOf(word) != words.Length - 1)
builder.Append(#"\u" + ((int)' ').ToString("X4"));
}
return builder.ToString();
}

Format Exception when trying to convert

So I'm trying to read from a text file and store each field into an array. But when I tried to convert accountNumber to an Int, I get an error.
public bool matchCustomer(int accountID){
string[] data = null;
string line = Global.currentFile.reader.ReadLine();
while (line != null)
{
data = line.Split('*');
this.accountNumber = Convert.ToInt32(data[0]);
line = Global.currentFile.reader.ReadLine();
if (accountID == this.accountNumber)
{
return true;
}
}
return false;
}
That's because data[0] isn't convertible into an int. What is data[0] at runtime?
You could use:
int value;
if(Int32.TryParse(data[0], out value))
{
accountNumber = value;
}
else
{
//Something when data[0] can't be turned into an int.
//You'll have to decide this logic.
}
Likely, because you split by delimiter * in the string:
12345 * Shrek * 1209 * 100,000 * 50,000
You left with a spaced number "12345 " instead of all numbers "12345". This causes it to be unconvertible. Try to apply Trim:
this.accountNumber = Convert.ToInt32(data[0].Trim());
Also, beware of strings with thousands separator comma (50,000 and 100,000). You might need to replace it with empty string if it is unconvertible:
data[4].Replace(",","").Trim();
Other two answers addressed the issue and fix, I thought of providing another alternative which uses Linq.
You can replace complete while block content with this.
return line.Split('*').Select(s=> s.Trim().Replace(",", ""))
.Where(c=> Regex.IsMatch(c.Trim(), #"\d+"))
.Select(s=>int.Parse(s.Trim()))
.Any(e=>e == accountId);
Working Demo

Convert Unicode string made up of culture-specific digits to integer value

I am developing a program in the Marathi language. In it, I want to add/validate numbers entered in Marathi Unicode by getting their actual integer value.
For example, in Marathi:
४५ = 45
९९ = 99
How do I convert this Marathi string "४५" to its actual integer value i.e. 45?
I googled a lot, but found nothing useful. I tried using System.Text.Encoding.Unicode.GetString() to get string and then tried to parse, but failed here also.
Correct way would be to use Char.GetNumericValue that lets you to convert individual characters to corresponding numeric values and than construct complete value. I.e. Char.GetNumericValue('९') gives you 9.
Depending on your goal it may be easier to replace each national digit character with corresponding invariant digit and use regular parsing functions.
Int32.Parse("९९".Replace("९", "9"))
Quick hack of #Alexi's response.
public static double ParseValue(string value)
{
return double.Parse(string.Join("",
value.Select(c => "+-.".Contains(c)
? "" + c: "" + char.GetNumericValue(c)).ToArray()),
NumberFormatInfo.InvariantInfo);
}
calling ParseValue("१२३.३२१") yields 123.321 as result
I found my solution...
The following code will convert given Marathi number to its equivalent Latin number..
Thanks to #Alexei, I just changed some of your code and its working fine..
string ToLatinDigits(string nativeDigits)
{
int n = nativeDigits.Length;
StringBuilder latinDigits = new StringBuilder(capacity: n);
for (int i = 0; i < n; ++i)
{
if (char.IsDigit(nativeDigits, i))
{
latinDigits.Append(char.GetNumericValue(nativeDigits, i));
}
else if (nativeDigits[i].Equals('.') || nativeDigits[i].Equals('+') || nativeDigits[i].Equals('-'))
{
latinDigits.Append(nativeDigits[i]);
}
else
{
throw new Exception("Invalid Argument");
}
}
return latinDigits.ToString();
}
This method is working for both + and - numbers.
Regards Guruprasad
Windows.Globalization.DecimalFormatter will parse different numeral systems in addition to Latin, including Devanagari (which is what is used by Marathi).

Ignore Zero in Calculate Hash by HMACSHA256

I use Crypto-JS v2.5.3 (hmac.min.js) http://code.google.com/p/crypto-js/ library to calculate client side hash and the script is:
$("#PasswordHash").val(Crypto.HMAC(Crypto.SHA256, $("#pwd").val(), $("#PasswordSalt").val(), { asByte: true }));
this return something like this:
b3626b28c57ea7097b6107933c6e1f24f586cca63c00d9252d231c715d42e272
Then in Server side I use the following code to calculate hash:
private string CalcHash(string PlainText, string Salt) {
string result = "";
ASCIIEncoding enc = new ASCIIEncoding();
byte[]
baText2BeHashed = enc.GetBytes(PlainText),
baSalt = enc.GetBytes(Salt);
System.Security.Cryptography.HMACSHA256 hasher = new HMACSHA256(baSalt);
byte[] baHashedText = hasher.ComputeHash(baText2BeHashed);
result = string.Join("", baHashedText.ToList().Select(b => b.ToString("x")).ToArray());
return result;
}
and this method returned:
b3626b28c57ea797b617933c6e1f24f586cca63c0d9252d231c715d42e272
As you see there is just some zero characters that the server side method ignore that. where is the problem? is there any fault with my server side method? I just need this two value be same with equal string and salt.
As you see there is just some zero characters that the server side method ignore that. where is the problem?
Here - your conversion to hex in C#:
b => b.ToString("x")
If b is 10, that will just give "a" rather than "0a".
Personally I'd suggest a simpler hex conversion:
return BitConverter.ToString(baHashedText).Replace("-", "").ToLowerInvariant();
(You could just change "x" to "x2" instead, to specify a length of 2 characters, but it's still a somewhat roundabout way of performing a bytes-to-hex conversion.)
Everyone else keeps reccomending to use things like using BitConverter and trimming "-" or using ToString(x2). There is a better solution, a class that has been in .NET since 1.1 SoapHexBinary.
using System.Runtime.Remoting.Metadata.W3cXsd2001;
public byte[] StringToBytes(string value)
{
SoapHexBinary soapHexBinary = SoapHexBinary.Parse(value);
return soapHexBinary.Value;
}
public string BytesToString(byte[] value)
{
SoapHexBinary soapHexBinary = new SoapHexBinary(value);
return soapHexBinary.ToString();
}
This will produce the exact format you want.
I believe the problem is here:
result = string.Join("", baHashedText.ToList().Select(b => b.ToString("x")).ToArray());
change it to:
result = string.Join("", baHashedText.ToList().Select(b => b.ToString("x2")).ToArray());

How to convert a string containing escape characters to a string

I have a string that is returned to me which contains escape characters.
Here is a sample string
"test\40gmail.com"
As you can see it contains escape characters. I need it to be converted to its real value which is
"test#gmail.com"
How can I do this?
If you are looking to replace all escaped character codes, not only the code for #, you can use this snippet of code to do the conversion:
public static string UnescapeCodes(string src) {
var rx = new Regex("\\\\([0-9A-Fa-f]+)");
var res = new StringBuilder();
var pos = 0;
foreach (Match m in rx.Matches(src)) {
res.Append(src.Substring(pos, m.Index - pos));
pos = m.Index + m.Length;
res.Append((char)Convert.ToInt32(m.Groups[1].ToString(), 16));
}
res.Append(src.Substring(pos));
return res.ToString();
}
The code relies on a regular expression to find all sequences of hex digits, converting them to int, and casting the resultant value to a char.
string test = "test\40gmail.com";
test.replace(#"\40","#");
If you want a more general approach ...
HTML Decode
The sample string provided ("test\40gmail.com") is JID escaped. It is not malformed, and HttpUtility/WebUtility will not correctly handle this escaping scheme.
You can certainly do it with string or regex functions, as suggested in the answers from dasblinkenlight and C.Barlow. This is probably the cleanest way to achieve the desired result. I'm not aware of any .NET libraries for decoding JID escaping, and a brief search hasn't turned up much. Here is a link to some source which may be useful, though.
I just wrote this piece of code and it seems to work beautifully... It requires that the escape sequence is in HEX, and is valid for value's 0x00 to 0xFF.
// Example
str = remEscChars(#"Test\x0D") // str = "Test\r"
Here is the code.
private string remEscChars(string str)
{
int pos = 0;
string subStr = null;
string escStr = null;
try
{
while ((pos = str.IndexOf(#"\x")) >= 0)
{
subStr = str.Substring(pos + 2, 2);
escStr = Convert.ToString(Convert.ToChar(Convert.ToInt32(subStr, 16)));
str = str.Replace(#"\x" + subStr, escStr);
}
}
catch (Exception ex)
{
throw ex;
}
return str;
}
.NET provides the static methods Regex.Unescape and Regex.Escape to perform this task and back again. Regex.Unescape will do what you need.
https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.unescape

Categories

Resources