C# - Replace Chars with its Unicode instance - c#

I'm developing the android application that reads book from JSON format.In order to create such type of books i needed the desktop application due to comfortableness and i chose C#.
First of all i want to say that in my native language there are lots of chars that should be encoded in Unicode not in ASCII for example...
[ə ç ş ğ ö ü and so on]
My problem is that there is problem with Json for some char formats and i should use the instance of this chars.(Unicode instance).For instance:
string text = "asdsdas";
text = ConvertToUnicode(Text);//->/u231/u213/u123...
i tried many ways to achieve this in JavaScript but i couldn't. Now devs please help me to solve this problem in C#.Thanks in advance any suggestion would be okay for me :).

You can define an extension method:
public static class Extension {
public static string ToUnicodeString(this string str) {
StringBuilder sb = new StringBuilder();
foreach(var c in str) {
sb.Append("\\u" + ((int) c).ToString("X4"));
}
return sb.ToString();
}
}
which can be called like myString.ToUnicodeString()
Check it in this demo.

Related

Assess if a c# string is a single Emoji OR an Emoji ZWJ Sequence?

What would be a way to tell if a c# string is a single Emoji, or a valid Emoji ZWJ Sequences?
I would like to basically be able to find any Emoji from the official unicode list, http://www.unicode.org/reports/tr51/tr51-15.html#emoji_data
I don't seem to find a nuget package for this, and most SO questions don't seem to be easily applicable to my case (i.e. Is there a way to check if a string in JS is one single emoji? )
I ended up using Unicode regex, which are partially implemented in .NET.
Using this question (C# - Regular expression to find a surrogate pair of a unicode codepoint from any string?), I came up with the following.
Regex
//Returns the Emoji
#"([\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So}"
//Returns true if the string is a single Emoji
#"^(?>(?>[\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So})$"
Tests
public class EmojiTests
{
private static readonly Regex IsEmoji = new Regex(#"^(?>(?>[\uD800-\uDBFF][\uDC00-\uDFFF]\p{M}*){1,5}|\p{So})$", RegexOptions.Compiled);
[Theory]
[InlineData("⭐")]
[InlineData("😁")]
[InlineData("🃏")]
[InlineData("🏴")]
[InlineData("👪🏿")]
[InlineData("🤌")]//pinched fingers, coming soon :p
public void ValidEmojiCases(string input)
{
Assert.Matches(IsEmoji, input);
}
[Theory]
[InlineData("")]
[InlineData(":p")]
[InlineData("a")]
[InlineData("<")]
[InlineData("⭐⭐")]
[InlineData("🃏a")]
[InlineData("‼️")]
[InlineData("↔️")]
public void InvalidEmojiCases(string input)
{
Assert.DoesNotMatch(IsEmoji, input);
}
}
It is not perfect (i.e. returns true for "™️", false for "◻️"), but that will do.

how to Convert HTML characters like #amp; to their Proper Form in C#

How to convert these characters to plain text?
â„¢,  ®, â„¢, ® and —
this problem occurs when I get a text from the website during scraping and store it into the database.
But it adds special characters and & like character.
I want to remove these all.
you can use this:
Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(myvalue));
try this:
public static string RemoveUTFCharactes(this string input)
{
string output = string.Empty;
if (!string.IsNullOrEmpty(input))
{
byte[] data = System.Text.Encoding.Default.GetBytes(input);
output = System.Text.Encoding.UTF8.GetString(data);
}
return output;
}
The short solution of your question is below
if you have limited symbols you can use the Replace method in C# language, like this
string symbol="this is the book #amp; laptop";
string formattedterm = symbol.Replace("#amp;","&");

Converting Unicode char Ids string to unicode text .NET

Im doing a web scraping project and i get a json file from the scraper , the problem is that for any lang other than english the actual unicode char ID is written for example :
it will store
פלסטינים
instead of
םויסלפנ
What i want to do is to input a string that stores char IDs + english text + HTML entitys ,and replace every unicode ID/HTML entity with the unicode char that fits it. Anyone knows on a method that can help me with the task?
Using
.NET
ASP.NET
JSON.NET
IronWebScraper
-A Bit new to stackoverflow
Edit:
Here's Code Sample
using (StreamReader r = new StreamReader(AppDomain.CurrentDomain.BaseDirectory + #"DataBase\net\net.jsonl"))
{
string json = r.ReadToEnd();
List<string> items = JsonConvert.DeserializeObject<List<string>>(json);
foreach (var str in items)
Logger.Log(WebUtility.HtmlDecode(str));
}
It's fairly simple: just use the WebUtility.HtmlDecode method:
var plainText = WebUtility.HtmlDecode("פלסטינים");
If there are any regular characters in there, they will be left alone:
var plainText = WebUtility.HtmlDecode("This is a Hebrew character: פ");
That will result in:
This is a Hebrew character: פ

UTF-8 escape sequence as string: surely a better way

Reviewing some old code of mine, and wondered if there was a better way to create a literal string with unicode symbols...
I have a REST interface that requires certain escaped characters; for example, a property called username with value of john%foobar+Smith that must be requested like this:
{"username":"john\u0025foobar\u002bSmith"}
My c# method to replace certain characters like % and + is pretty basic:
public static string EncodeUTF8(string unescaped) {
string utf8_ampersand = #"\u0026";
string utf8_percent = #"\u0025";
string utf8_plus = #"\u002b";
return unescaped.Replace("&", utf8_ampersand).Replace("+", utf8_plus).Replace("%", utf8_percent);
}
This seems an antiquated way to do this; surely there is some single line method using Encoding that would output literal UTF code, but I can't find any examples that aren't essentially replace statements like mine... is there a better way?
You could do it with Regex:
static readonly Regex ReplacerRegex = new Regex("[&+%]");
public static string Replace(Match match)
{
// 4-digits hex of the matched char
return #"\u" + ((int)match.Value[0]).ToString("x4");
}
public static string EncodeUTF8(string unescaped)
{
return ReplacerRegex.Replace(unescaped, Replace);
}
But i don't suggest it very much (unless you have tens of replaces). I do think it would be slower, and bigger to write.

Best way to decode hex sequence of unicode characters to string

I'm working with C# .Net
I would like to know how to convert a Unicode form string like "\u1D0EC"
(note that it's above "\uFFFF") to it's symbol... "𝃬"
Thanks For Advance!!!
That Unicode codepoint is encoded in UTF32. .NET and Windows encode Unicode in UTF16, you'll have to translate. UTF16 uses "surrogate pairs" to handle codepoints above 0xffff, a similar kind of approach as UTF8. The first code of the pair is 0xd800..dbff, the second code is 0xdc00..dfff. Try this sample code to see that at work:
using System;
using System.Text;
class Program {
static void Main(string[] args) {
uint utf32 = uint.Parse("1D0EC", System.Globalization.NumberStyles.HexNumber);
string s = Encoding.UTF32.GetString(BitConverter.GetBytes(utf32));
foreach (char c in s.ToCharArray()) {
Console.WriteLine("{0:X}", (uint)c);
}
Console.ReadLine();
}
}
Convert each sequence with int.Parse(String, NumberStyles) and char.ConvertFromUtf32:
string s = #"\U1D0EC";
string converted = char.ConvertFromUtf32(int.Parse(s.Substring(2), NumberStyles.HexNumber));
I have recently push my FOSS Uncode Converter at Codeplex (http://unicode.codeplex.com)
you can convert whatever you want to Hex code and from Hex code to get the right character, also there is a full information character database.
I use this code
public static char ConvertHexToUnicode(string hexCode)
{
if (hexCode != string.Empty)
return ((char)int.Parse(hexCode, NumberStyles.AllowHexSpecifier));
char empty = new char();
return empty;
}//end
you can see entire code on the http://unicode.codeplex.com/
It appears you just want this in your code... you can type it as a string literal using the escape code \Uxxxxxxxx (note that this is a capital U, and there must be 8 digits). For this example, it would be: "\U0001D0EC".

Categories

Resources