Find and replace ASCII character with a new line - c#

I am trying to find every occurrence of an ASCII character in a string and replace it with a new line. Here is what I have so far:
public string parseText(string inTxt)
{
//String builder based on the string passed into the method
StringBuilder n = new StringBuilder(inTxt);
//Convert the ASCII character we're looking for to a string
string replaceMe = char.ConvertFromUtf32(187);
//Replace all occurences of string with a new line
n.Replace(replaceMe, Environment.NewLine);
//Convert our StringBuilder to a string and output it
return n.ToString();
}
This does not add in a new line and the string all remains on one line. I’m not sure what the problem is here. I have tried this as well, but same result:
n.Replace(replaceMe, "\n");
Any suggestions?

char.ConvertFromUtf32, whilst correct, is not the simplest way to read a character based on its ASCII numeric value. (ConvertFromUtf32 is mainly intended for Unicode code points that lie outside the BMP, which result in surrogate pairs. This is not something you'd encounter in English or most modern languages.) Rather, you should just cast it using (char).
char c = (char)187;
string replaceMe = c.ToString();
You may, of course, define a string with the required character as a literal in your code: "»".
Your Replace would then be simplified to:
n.Replace("»", "\n");
Finally, on a technical level, ASCII only covers characters whose value lies in the 0–127 range. Character 187 is not ASCII; however, it corresponds to » in ISO 8859-1, Windows-1252, and Unicode, which collectively are by far the most popular encodings in use today.
Edit: I just tested your original code, and found that it actually worked. Are you sure the result remains on one line? It might be an issue with the way the debugger renders strings in single-line view:
Note that the \r\n sequences actually do represent newlines, despite being displayed as literals. You can check this from the multi-line display (by clicking on the magnifying glass):

StringBuilder.Replace returns a new StringBuilder with the changes made. Strange, I know, but this should work:
StringBuilder replaced = n.Replace(replaceMe, Environment.NewLine);
return replaced.ToString();

Related

How to unescape multibyte unicode in c# [duplicate]

This question already has answers here:
How to unescape unicode string in C#
(2 answers)
Closed 2 years ago.
The following unicode string from a text file encodes a single apostrophe using 3 bytes:
It\u00e2\u0080\u0099s working
This should decode to:
It’s working
How can I decode this string in C#?
For example, when I try the following code:
string test = #"It\u00e2\u0080\u0099s working";
string test2 = System.Text.RegularExpressions.Regex.Unescape(test);
it incorrectly decodes the first byte only:
Itâ\u0080\u0099s awesome
This is UTF8. Try UTF8 Encoding
using System.Text;
using System.Text.RegularExpressions;
string test = "It\u00e2\u0080\u0099s working";
byte[] bytes = Encoding.GetEncoding(28591)
.GetBytes(test);
var converted = Encoding.UTF8.GetString(bytes);//It’s working
try this to parse file :
private static Regex _regex = new Regex(#"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);
public string decodeString(string value)
{
return _regex.Replace(
value,
m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
);
}
That is javascript unicode encoding. Use a C# javascript deserializer to convert it.
(I don't have enough reputation to comment, so I will write here)
Where did you get those characters from in the first place?
\uXXXX is an encoding used by JavaScript and C# (didn't know about C# this until now) to encode 16 bit Unicode characters in string literals. 16 bit - 4 hex characters, so \uXXXX, each X representing one Hexadecimal digit.
Note this is used to encode string literals in source code! It is not used to encode the bytes stored in files or memory or what not. It is an older style of encoding due to modern source code editors usually support UTF-8 or UTF-16 or some other encoding to be able to store unicode characters in source code files, and then they are also able to display the unicode character symbol, and allow it being typed right at the editor. So \uXXXX typing is not needed, and going out of style.
So that is why I asked where did you get the string initially? You wrote in one comment you read it from a file? What generated the file?
If each \uXXXX is taken alone by itself as unicode characters, which is what \uXXXX means, doesn't make sense being there. 00e2 is a character a with cap on it, 0080 and 0099 are control characters, not printable.
If e28099 are taken together as three single bytes, i.e. dropping off 00 valued first bytes of each as they are in the form of \u00XX then it fits as a UTF8 character representation of a Unicode character with decimal value 2019, which is "Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)"
Then that is what you are looking for, but this doesn't seem correct usage of encoding that generated that string. If you end up with those strings and have to evaluate them, then comments above by "C# Novice" is working, but it may not work in every case.
You could convert string literals that uses \uXXXX encoding in its strings using a javascript script evaluator, or CSharpScript.Run() to make a string literal with those and assign to a variable, and then look at its bytes. But I tried that later and due to those byte values/characters not making sense I don't get anything meaningful from them. I get an a with a cap, and the next two, CSharpScript refuses to decode and leaves as is. Becuase those are control characters when decoded.
Here three different ways using C# avaliable libraries doing \uXXXX decoding. The first two uses NewtonSoft.JSON package, the last uses Roslyn/CSharpScript, both avalilable from Nuget. Note none of these print single aposthrope, due to what I described above. In contrast, if I change the string to "\u3053\u3093\u306B\u3061\u306F\u4E16\u754C!", it prints on the debug output window this Japanese text: "こんにちは世界!" , which is what Google translate told me is Japanese translation of "Hello World!"
https://translate.google.com/?sl=ja&tl=en&text=%E3%81%93%E3%82%93%E3%81%AB%E3%81%A1%E3%81%AF%E4%B8%96%E7%95%8C!&op=translate
So in summary, whatever generated those scripts, doesn't seem to be doing standard things.
string test = #"It\u00e2\u0080\u0099s working";
// Using JSON deserialization, since \uXXXX is valid encoding JavaScript string literals
// Have to add starting and ending quotes to make it a script literal definition, then deserialize as string
var d = Newtonsoft.Json.JsonConvert.DeserializeObject("\"" + test + "\"", typeof(string));
Console.WriteLine(d);
System.Diagnostics.Debug.WriteLine(d);
// Another way of JavaScript deserialization. If you are using a stream like reading from file this maybe better:
TextReader reader = new StringReader("\"" + test + "\"");
Newtonsoft.Json.JsonTextReader rdr = new JsonTextReader(reader);
rdr.Read();
Console.WriteLine(rdr.Value);
System.Diagnostics.Debug.WriteLine(rdr.Value);
// lastly overkill and too heavy: Using Roslyn CSharpScript, and letting C# compiler to decode \uXXXX's in string literal:
ScriptOptions opt = ScriptOptions.Default;
//opt = opt.WithFileEncoding(Encoding.Unicode);
Task<ScriptState<string>> task = Task.Run(async () => { return CSharpScript.RunAsync<string>("string str = \"" + test + "\".ToString();", opt); }).Result;
ScriptState<string> s = task.Result;
var ddd = s.Variables[0];
Console.WriteLine(ddd.Value);
System.Diagnostics.Debug.WriteLine(ddd.Value);

String comparison returns False for same strings [duplicate]

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).
As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");
As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}
Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.

Simplest way to get rid of zero-width-space in c# string

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).
As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");
As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}
Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.

How to get index of any charcter in unicode string

I having a string variable which basically holds value of corresponding English word in the form of Chinese.
String temp = "'%1'不能输入步骤'%2'";
But when i want to know wether the string having %1 in it or not by using IndexOf function
if(temp.IndexOf("%1") != -1)
{
}
I am not getting true even if it contain %1.
So is there any issue due to Chinese charters or any thing else.
Pls suggest me how i can get the index of any charter in above case.
That is because %1 is not equal to %1 What you want to do in this case as workaround is select the symbols out of string you have like
var s = "'%1'不能输入步骤'%2'";
var firstFragment = s.Substring(1, 2); // this should select you %1
and then do
if(temp.IndexOf(first) != -1){
}
Comments gave the answer. Use the same percent character, so instead of:
"%1"
use:
"%1"
Or, if you find that problematic (your source code is in a "poor" code page, or you fear the code is hard to read when it contains full-width characters that resemble ASCII characters), use:
"\uFF051"
or even:
"\uFF05" + "1"
(concatenation will be done by the C# compiler, no extra concatting done at run-time).
Another approach might be Unicode normalization:
temp = temp.Normalize(NormalizationForm.FormKC);
which seems to project the "exotic" percent char into the usual ASCII percent char, although I am not sure if that behavior is guaranteed, but see the Decomposition field on Unicode Character 'FULLWIDTH PERCENT SIGN' (U+FF05).

Is it possible to enter a New Line in a string without Escape Sequences?

I want a String to have a New Line in it, but I cannot use escape sequences because the interface I am sending my string to does not recognize them. As far as I know, C# does not actually store a New Line in the String, but rather it stores the escape sequence, causing the literal contents to be passed, rather than what they actually mean.
My best guess is that I would have to somehow parse the number 10 (the decimal value of a New Line according to the ASCII table) into ASCII. But I'm not sure how to do that, because C# parses numbers directly to String if attempting this:
"hello" + 10 + "world"
Any suggestions?
If you say "hello\nworld", the actual string will contain:
hello
world
There will be an actual new-line character in the string. At no point are the characters \ and n stored in the string.
There are a few ways to get the exact same result, but a simple \n in the string is a common way.
A simple cast should also do the same:
"hello" + (char)10 + "world"
Although likely slightly slower because of string concatenation. I say "likely" because it could probably be optimized away, or an actual example using \n will also result in string concatenation, taking roughly the same amount of time.
Test.
The preferred new line character is Environment.NewLine for its cross-platform capability.
You could use xml for communication, if you're receiver can handle this

Categories

Resources