Converting Unicode char Ids string to unicode text .NET - c#

Im doing a web scraping project and i get a json file from the scraper , the problem is that for any lang other than english the actual unicode char ID is written for example :
it will store
פלסטינים
instead of
םויסלפנ
What i want to do is to input a string that stores char IDs + english text + HTML entitys ,and replace every unicode ID/HTML entity with the unicode char that fits it. Anyone knows on a method that can help me with the task?
Using
.NET
ASP.NET
JSON.NET
IronWebScraper
-A Bit new to stackoverflow
Edit:
Here's Code Sample
using (StreamReader r = new StreamReader(AppDomain.CurrentDomain.BaseDirectory + #"DataBase\net\net.jsonl"))
{
string json = r.ReadToEnd();
List<string> items = JsonConvert.DeserializeObject<List<string>>(json);
foreach (var str in items)
Logger.Log(WebUtility.HtmlDecode(str));
}

It's fairly simple: just use the WebUtility.HtmlDecode method:
var plainText = WebUtility.HtmlDecode("פלסטינים");
If there are any regular characters in there, they will be left alone:
var plainText = WebUtility.HtmlDecode("This is a Hebrew character: פ");
That will result in:
This is a Hebrew character: פ

Related

How to get the char from a string that contains unicode escape sequence which starts with "\u"

I did quite a lot of researches but still could not figure it out. Here is an example, I got a string contains "\uf022" (a character from another language), how can I change the whole string into the char '\uf022'?
Update:
the string "\uf022" is retrieved during runtime (read from other sources) instead of directly putting a static character into the string.
For example:
string url = "https://somesite/files/abc\uf022def.pdf";
int i = url.IndexOf("\\");
string specialChar = url.substring(i, 6);
How do I get the char saved in the string specialChar?
I would like to use this char to do UTF-8 encoding and generate the accessible URL "https://somesite/files/abc%EF%80%A2def.pdf".
Thank you!
how can I change the whole string into the char '\uf022'?
Strictly speaking, you can't change the characters of the string you have (because strings are immutable), but you can make a new one that meets your demands..
var s = new string('\uf022', oldstring.Length);
Your title of your question reads slightly differently.. it sounds like you want a string that is only the F022 chars, i.e. if your string has 10 chars and only 3 of them are F022, you want just the 3.. which could be done by changing oldstring.Length above, into oldstring.Count(c => c == '\uf022')
..and if you mean your string is like "hello\uf022world" and you want it to be like "hello🍄world" then do
var s = oldstring.Replace("\\uf022", "\uf022");
If you have the \uf022 in a string (6 chars) and you want to replace it with its actual character, you can parse it to int and convert to char when you replace..
var oldstring = "hello\uf022world";
var given = "\uf022";
var givenParsed = ((char)Convert.ToInt32(given.Substring(2), 16)).ToString();
var s = oldstring.Replace(given, givenParsed);

how to Convert HTML characters like #amp; to their Proper Form in C#

How to convert these characters to plain text?
â„¢,  ®, â„¢, ® and —
this problem occurs when I get a text from the website during scraping and store it into the database.
But it adds special characters and & like character.
I want to remove these all.
you can use this:
Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(myvalue));
try this:
public static string RemoveUTFCharactes(this string input)
{
string output = string.Empty;
if (!string.IsNullOrEmpty(input))
{
byte[] data = System.Text.Encoding.Default.GetBytes(input);
output = System.Text.Encoding.UTF8.GetString(data);
}
return output;
}
The short solution of your question is below
if you have limited symbols you can use the Replace method in C# language, like this
string symbol="this is the book #amp; laptop";
string formattedterm = symbol.Replace("#amp;","&");

Base64 string encoding contains +, / and = instead of A, B, C

I need to apply the following transformation to a string:
convert the string in byte[]
apply the sha256 function
encode the result in base64
I wrote the following code:
string codeRaw = "C0643778W.EUC06AG978W.EUFWELP2014-11-2153.50000GBP24.00000MWh/h10YCB-EUROPEU--12015-01-012015-01-31";
byte[] utiCodeByteArr = Encoding.UTF8.GetBytes(codeRaw);
byte[] hashByteArr = new SHA256Managed().ComputeHash(utiCodeByteArr);
string hash = Convert.ToBase64String(hashByteArr)
It works, but the result is a little bit different from was I should get: the string contains the chars '+', '/' and '=' instead of 'A', 'B' and 'C'.
"qWAIh1CgYAuvoRTGcvXKLBHC9UxRunSBRjRXlqhYh6gC" //expected result
"qW+Ih1CgYAuvoRTGcvXKLBHC9UxRunS/RjRXlqhYh6g=" //got result
I've solved with a replace
string hash = Convert.ToBase64String(hashByteArr)?.Replace("+", "A")?.Replace("/", "B")?.Replace("=", "C");
There is a better way to get the right string without using the replaces?
I don't like them.
The manual with the requirements say: "The APIs used are the ones provided by .NET framework", but it doesn't contains the source code: maybe there is a way to get immediately the ABC chars, but I miss it.
Thanks.
The provider sent me the source code: there was a replace as the one I did.
They forgot to wrote that information in the manual.

C# - Replace Chars with its Unicode instance

I'm developing the android application that reads book from JSON format.In order to create such type of books i needed the desktop application due to comfortableness and i chose C#.
First of all i want to say that in my native language there are lots of chars that should be encoded in Unicode not in ASCII for example...
[ə ç ş ğ ö ü and so on]
My problem is that there is problem with Json for some char formats and i should use the instance of this chars.(Unicode instance).For instance:
string text = "asdsdas";
text = ConvertToUnicode(Text);//->/u231/u213/u123...
i tried many ways to achieve this in JavaScript but i couldn't. Now devs please help me to solve this problem in C#.Thanks in advance any suggestion would be okay for me :).
You can define an extension method:
public static class Extension {
public static string ToUnicodeString(this string str) {
StringBuilder sb = new StringBuilder();
foreach(var c in str) {
sb.Append("\\u" + ((int) c).ToString("X4"));
}
return sb.ToString();
}
}
which can be called like myString.ToUnicodeString()
Check it in this demo.

Encoding and decoding a string that may have slashes in it

I have strings like this:
RowKey = "Local (Automatic/Manual) Tests",
When I try to store in Windows Azure then this fails as I assume it does not accept the "/" as part of the row key.
Is there a simple way that I can encode the value before putting into RowKey?
Also once the data is in the table I get it out with the following:
var Stores = storeTable.GetAll(u => u.PartitionKey == "ABC");
Is there a simple way that I can get out the value of RowKey and decode it?
One possible way for handling is this by converting the PartitionKey and RowKey values in Base64 encoded string and save it. Later when you retrieve the values, you just decode it. In fact I have had this issue some days back in our tool and Base64 encoding was suggested to me on MSDN forums: http://social.msdn.microsoft.com/Forums/en-US/windowsazuredata/thread/a20cd3ce-20cb-4273-a1f2-b92a354bd868. But again it is not fool proof.
I'm not familiar with Azure, so I don't know if there is an existing API for that. But it's not hard to code:
encode:
const string escapeChar='|';
RowKey.Replace(escapeChar,escapeChar+escapeChar).Replace("/",escapeChar+"S");
decode:
StringBuilder sb=new StringBuilder(s.Length);
bool escape=false;
foreach(char c in s)
{
if(escape)
{
if(c=='S')
sb.Append('/');
else if(c==escapeChar)
sb.Append(escapeChar);
else
throw new ArgumentException("Invalid escape sequence "+escapeChar+c);
}
else if(c!=escapeChar)
{
sb.Append(c);
escape=false;
}
else
escape=true;
return sb.ToString();
When a string is Base64 encoded, the only character that is invalid in an Azure Table Storage key column is the forward slash ('/'). To address this, simply replace the forward slash character with another character that is both (1) valid in an Azure Table Storage key column and (2) not a Base64 character. The most common example I have found (which is cited in other answers) is to replace the forward slash ('/') with the underscore ('_').
private static String EncodeToKey(String originalKey)
{
var keyBytes = System.Text.Encoding.UTF8.GetBytes(originalKey);
var base64 = System.Convert.ToBase64String(keyBytes);
return base64.Replace('/','_');
}
When decoding, simply undo the replaced character (first!) and then Base64 decode the resulting string. That's all there is to it.
private static String DecodeFromKey(String encodedKey)
{
var base64 = encodedKey.Replace('_', '/');
byte[] bytes = System.Convert.FromBase64String(base64);
return System.Text.Encoding.UTF8.GetString(bytes);
}
Some people have suggested that other Base64 characters also need encoding. According to the Azure Table Storage docs this is not the case.

Categories

Resources