Is there a more elegant way to change Unicode to Ascii?

Is there a more elegant way to change Unicode to Ascii? - c#

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.
In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?
Heres what I have at the moment...
formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");
Any Ideas?

It's a rather inelegant problem, so no method will really be deeply elegant.
Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).
At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).
I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).
Taking this approach to your code it becomes:
formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');
Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:
StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
switch(c)
{
case '\u2013': case '\u2014': case '\u2015':
sb.Append('-');
default:
sb.Append(c)
}
formattedString = sb.ToString();
(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).
With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).
Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:
case 'ß':
sb.Append("ss");
Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.
Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:
char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
sb.Append(replacements[(int)c]);
formattedString = sb.ToString();
Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.
Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:
char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();
char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;
/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/
StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
int cAsI = (int)c;
sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
throw new ArgumentException("Unconvertable character");
formattedString = ret;
The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.
The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.
Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?
In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.

You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.
For example:
var lookup = new Dictionary<char, char>
{
{ '`', '-' },
{ 'இ', '-' },
//next pair, etc, etc
};
var input = "blah இ blah ` blah";
var r;
var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);
string output = new string(result.ToArray());
Or if you want blanket treatment of non ASCII range characters:
string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());

Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.
That being said, you could make a few improvements.
If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.

If they are all replaced with the same string:
formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));
or
foreach (char c in "\u2013\u2014\u2015")
formattedString = formattedString.Replace(c, '-');

Related

String comparison returns False for same strings [duplicate]

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}

Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.

Dealing with escape sequences with ReadOnlySpan<char>

The ReadOnlySpan<char> is said to be perfect for parsing so I tried to use it and I came across a use case that I don't know how to handle.
I have a command-line string where the argument prefix - and the separator (space) are escaped (I know I could quote them here but for the sake of this problem let's assume it's not an option):
var str = #"foo -bar \-baz\ qux".AsMemory();
The tokenizer should return the following tokens:
foo - command name
bar - argument name
-baz qux - argument value
Cases 1 & 2 are simple because here I can just use str.Slice(i, length) but how can I create the 3rd case and return only a single ReadOnlySpan<char>? The Slice method doesn't allow me to specify multiple start/length ranges which would be necessary in order to jump over the escape char \.
Example:
str.Slice((10, 4), (15, 3));
where (10,4) = "-bar" and (15,3) = " qux"
With StringBuilder you can just skip a couple of characters and Append the others later. How would I achieve the same result with ReadOnlySpan<char>?

A Span/ReadOnlySpan is a contiguous block of memory. It cannot contain multiple ranges. This design is necessary for performance. Span/ReadOnlySpan is supposed to be roughly as fast as an array is. Arrays are fast because they are contiguous memory blocks with no further abstractions.
I don't see a way to do this without allocating a new string. You can use Span/ReadOnlySpan for all contiguous substrings but it seems your parsing problem is not suitable to use span to store results.

have a look at:
https://github.com/nemesissoft/Nemesis.TextParsers
and more precisely at:
TokenSequence.cs
Usage:
var tokens = "ABC|CD\|E".AsSpan().Tokenize('|', '\\', false); //no allocation. Result in 2 elements: "ABC", "CD\|E".
Consume via:
var result = new List<string>();
foreach (var part in tokens)
result.Add(part.ToString());
Unescaping can be done via:
ParsedSequence.cs
or
SpanParserHelper.UnescapeCharacter()
Hope this helps

Simplest way to get rid of zero-width-space in c# string

I am parsing emails using a regex in a c# VSTO project. Once in a while, the regex does not seem to work (although if I paste the text and regex in regexbuddy, the regex correctly matches the text). If I look at the email in gmail, I see
=E2=80=8B
at the beginning and end of some lines (which I understand is the UTF8 zero width space); this appears to be what is messing up the regex. This seems to be only sequence showing up.
What is the easiest way to get rid of this exact sequence? I cannot do the obvious
MailItem.Body.Replace("=E2=80=8B", "")
because those characters don't show up in the c# string.
I also tried
byte[] bytes = Encoding.Default.GetBytes(MailItem.TextBody);
string myString = Encoding.UTF8.GetString(bytes);
But the zero-width spaces just show up as ?. I suppose I could go through the bytes array and remove the bytes comprising the zero width space, but I don't know what the bytes would look like (it does not seem as simple as converting E2 80 8B to decimal and searching for that).

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("\u200B", "");

As all the Regex.Replace() methods operate on strings, that's not going to be useful here.
The string indexer returns a char, so for want of a better solution (and if you can't predict where these characters are going to be), as long-winded as it seems, you may be best off with:
StringBuilder newText = new StringBuilder();
for (int i = 0; i < MailItem.Body.Length; i++)
{
if (a[i] != '\u200b')
{
newText.Append(a[i]);
}
}

Use System.Web.HttpUtility.HtmlDecode(string);
Quite simple.

How to count unique characters in string [duplicate]

This question already has answers here:
How to get distinct characters?
(9 answers)
Closed 8 years ago.
Lets say we have variable myString="blabla" or mystring=998769
myString.Length; //will get you your result
myString.Count(char.IsLetter); //if you only want the count of letters:
How to get, unique character count? I mean for "blabla" result must be 3, doe "998769" it will be 4. Is there ready to go function? any suggestions?

You can use LINQ:
var count = myString.Distinct().Count();
It uses a fact, that string implements IEnumerable<char>.
Without LINQ, you can do the same stuff Distinct does internally and use HashSet<char>:
var count = (new HashSet<char>(myString)).Count;

If you handle only ANSI text in English (or characters from BMP) then 80% times if you write:
myString.Distinct().Count()
You will live happy and won't ever have any trouble. Let me post this answer only for who will really need to handle that in the proper way. I'd say everyone should but I know it's not true (quote from Wikipedia):
Because the most commonly used characters are all in the Basic Multilingual Plane, handling of surrogate pairs is often not thoroughly tested. This leads to persistent bugs and potential security holes, even in popular and well-reviewed application software (e.g. CVE-2008-2938, CVE-2012-2135)
Problem of our first naïve solution is that it doesn't handle Unicode properly and it also doesn't consider what user perceive as character. Let's try "𠀑".Distinct().Count() and your code will wrongly return...2 because its UTF-16 representation is 0xD840 0xDC11 (BTW each of them, alone, is not a valid Unicode character because they're high and low surrogate, respectively).
Here I won't be very strict about terms and definitions so please refer to www.unicode.org as reference. For a (much) more broad discussion please read How can I perform a Unicode aware character by character comparison?, encoding isn't only issue you have to consider.
1) It doesn't take into account that .NET System.Char doesn't represent a character (or more specifically a grapheme) but a code unit of a UTF-16 encoded text (possible, for example, with ideographic characters). Often they coincide but now always.
2) If you're counting what user thinks (or perceives) as a character then this will fail again because it doesn't check combined characters like ا́ (many examples of this in Arabic language). There are duplicates that exists for historical reasons: for example é it's both a single Unicode code point and a combination (then that code will fail).
3) We're talking about a western/American definition of character. If you're counting characters for end-users you may need to change your definition to what they expect (for example in Korean language definition of character may not be so obvious, another example is Czech text ch that is always counted as a single character). Finally don't forget some strange things when you convert characters to upper case/lower case (for example in German language ß is SS in upper case, see also this post).
Encoding
C# strings are encoded as UTF-16 (char is two bytes) but UTF-16 isn't a fixed size encoding and char should be properly called code unit. What does it mean? That you may have a string where Length is 2 but actually user will see (and it's actually is) just one character (then count should be 1).
If you need to handle this properly then you have to make things much more complicated (and slow). Fortunately Char class has some helpful methods to handle surrogates.
Following code is untested (and for illustration purposes so absolutely not optimized, I'm sure it can be done much better than this) so get it just as starting point for further investigations:
int CountCharacters(string text)
{
HashSet<string> characters = new HashSet<string>();
string currentCharacter = "";
for (int i = 0; i < text.Length; ++i)
{
if (Char.IsHighSurrogate(text, i))
{
// Do not count this, next one will give the full pair
currentCharacter = text[i].ToString();
continue;
}
else if (Char.IsLowSurrogate(text, i))
{
// Our "character" is encoded as previous one plus this one
currentCharacter += text[i];
}
else
currentCharacter = text[i].ToString();
if (!characters.Contains(currentCharacter))
characters.Add(currentCharacter);
}
return characters.Count;
}
Note that this example doesn't handle duplicates (when same character may have different codes or can be a single code point or a combined character).
Combined Characters
If you have to handle combined characters (and of course encoding) then best way to do it is to use StringInfo class. You'll enumerate (and then count) both combined and encoded characters:
StringInfo.GetTextElementEnumerator(text).Walk()
.Distinct().Count();
Walk() is a trivial to implement extension method that simply walks through all IEnumerator elements (we need it because GetTextElementEnumerator() returns IEnumerator instead of IEnumerable).
Please note that after text has been properly splitted it can be counted with our first solution (the point is that brick isn't char but a sequence of char (for simplicity here returned as string itself). Again this code doesn't handle duplicates.
Culture
There is not much you can do to handle issues listed at point 3. Each language has its own rules and to support them all can be a pain. More examples about culture issues on this longer specific post.
It's important to be aware of them (so you have to know little bit about languages you're targeting) and don't forget that Unicode and few translated resx files won't make your application global.
If text processing is important in your application you can solve many issues using specialized DLLs for each locale you support (to count characters, to count words and so on) like Word Processors do. For example, issues I listed can be simply solved using dictionaries. What I usually do is to do not use standard .NET functions for strings (also because of some bugs), I create a Unicode class with static methods for everything I need (character counting, conversions, comparison) and many specialized derived classes for each supported language. At run-time that static methods will user current thread culture name to pick proper implementation from a dictionary and to delegate work to that. A skeleton may be something like this:
abstract class Unicode
{
public static string CountCharacters(string text)
{
return GetConcreteClass().CountCharactersCore(text);
}
protected virtual string CountCharactersCore(string text)
{
// Default implementation, overridden in derived classes if needed
return StringInfo.GetTextElementEnumerator(text).Cast<string>()
.Distinct().Count();
}
private Dictionary<string, Unicode> _implementations;
private Unicode GetConcreteClass()
{
string cultureName = Thread.Current.CurrentCulture.Name;
// Check if concrete class has been loaded and put in dictionary
...
return _implementations[cultureName];
}
}

If you're using C# then Linq comes nicely to the rescue - again:
"blabla".Distinct().Count()
will do it.

Remove excessive whitespace in user input field

In my controller method for handling a (potentially hostile) user input field I have the following code:
string tmptext = comment.Replace(System.Environment.NewLine, "{break was here}"); //marks line breaks for later re-insertion
tmptext = Encoder.HtmlEncode(tmptext);
//other sanitizing goes in here
tmptext = tmptext.Replace("{break was here}", "<br />");
var regex = new Regex("(<br /><br />)\\1+");
tmptext = regex.Replace(tmptext, "$1");
My goal is to preserve line breaks for typical non-malicious use and display user input in safe, htmlencoded strings. I take the user input, parse it for newline characters and place a delimiter at the line breaks. I perform the HTML encoding and reinsert the breaks. (i will likely change this to reinserting paragraphs as p tags instead of br, but for now i'm using br)
Now actually inserting real html breaks opens me up to a subtle vulnerability: the enter key. The regex.replace code is there to strip out a malicious user just standing on the enter key and filling the page with crap.
This is a fix for big crap floods of just white but still leaves me open to abuse like entering one character, two line breaks, one character, two line breaks all down the page.
My question is for a method of determining that this is abusive and failing it on validation. I'm scared that there might not be a simple procedural method to do it and instead will need heuristic techniques or bayesian filters. Hopefully, someone has an easier, better way.
EDIT: perhaps I wasn't clear in the problem description, the regex handles seeing multiple line breaks in a row and converting them to just one or two. That problem is solved. The real problem is distinguishing legitimate text from crap flood like this:
a
a
a
...imagine 1000 of these...
a
a
a
a

A random suggestion, inspired by slashdot.org's comment filters: compress your user input with a System.IO.Compression.DeflateStream, and if it is too small in comparison with the original (you'll have to do some experimentation to find a useful cut-off) reject it.

I would HttpUtility.HtmlEncode the string, then convert newline characters to <br/>.
HttpUtility.HtmlEncode(subject).Replace("\r\n", "<br/>").Replace("\r", "<br/>").Replace("\n", "<br/>");
Also you should perform this logic when you are outputting to the user, not when saving in the database. The only validation I do on the database is make sure it's properly escaped (other than normal business rules that is).
EDIT: To fix the actual problem however, you can use Regex to replace multiple newlines with a single newline beforehand.
subject = Regex.Replace(#"(\r\n|\r|\n)+", #"\n", RegexOptions.Singleline);
I'm not sure if you would need RegexOptions.Singleline.

It sounds like you're tempted to try something "clever" with a regex, but IMO the simplest approach is to just loop through the characters of the string copying them to a StringBuilder, filtering as you go.
Any that fail a char.IsWhiteSpace() test are not copied. (If one of these is a newline, then insert a <br/> and don't allow any more <br/>'s to be added until you have hit a non-whitespace character).
edit
If you want to stop the user entering any old crap, give up now. You will never find a way filtering that a user can't find a way around in less than a minute, if they really want to.
You will be much better off putting a limit on the number of newlines, or the total number of characters, in the input.
Think of how much effort it will take to do something clever to sanitise "bad input", and then consider how likely it is that this will happen. Probbaly there is no point. Probably all the sanitisation you really need is to ensure the data is legal (not too large for your system to handle, all dangerous characters stripped or escaped, etc). (This is exactly why forums have human moderators who can filter the posts based on whatever criteria are approriate).

This is not the most efficient way of handling this, nor the smartest (disclaimer),
but if your text is not too big it doesn't matter much and short of any smarter algorithms (note: it's hard to detect something like char\nchar\nchar\n... though you could set a limit on the line len)
You could just Split on white characters (add any you can think of, short of \n) - then Join with just one space and then split on \n (to get lines) - join with <br />. While joining the lines you can test for line.Length > 2 e.g. or something.
To make this faster you can iterate with a more efficient algorithm, char by char, using IndexOf etc..
Again not the most efficient or perfect way of handling this but would give you something fast.
EDIT: to filter 'same lines' - you could use e.g. DistinctUntilChanged - that's from the Ix - Interactive extensions (see NuGet Ix-experimental I think) which should filter 'same lines' consecutive + you could add line test for those.

Rather than attempting to replace the newlines with filtered text and then attempting to use regular expressions on that, why not sanitize your data before inserting the <br /> tags? Don't forget to sanitize the input with HttpUtility.HtmlEncode first.
In an attempt to take care of multiple short lines in a row, here's my best attempt:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class Program {
static void Main() {
// Arbirary cutoff used to join short strings.
const int Cutoff = 6;
string input =
"\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome\r\n" +
"unsanatized\r\nbreaks\r\nand\ra\nsh\nor\nt\r\n\na\na\na\na" +
"\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na\na";
input = (input ?? String.Empty).Trim(); // Don't forget to HtmlEncode it.
StringBuilder temp = new StringBuilder();
List<string> result = new List<string>();
var items = input.Split(
new[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries)
.Select(i => new { i.Length, Value = i });
foreach (var item in items) {
if (item.Length > Cutoff) {
if (temp.Length > 0) {
result.Add(temp.ToString());
temp.Clear();
}
result.Add(item.Value);
continue;
}
if (temp.Length > 0) { temp.Append(" "); }
temp.Append(item.Value);
}
if (temp.Length > 0) {
result.Add(temp.ToString());
}
Console.WriteLine(String.Join("<br />", result));
}
}
Produces the following output:
thisisatest<br />string with some<br />unsanatized<br />breaks and a sh or t a a
a a a a a a a a a a a a a a a a a a a
I'm sure you've already come up with this solution but unfortunately what you're asking for isn't very straight forward.
For those interested, here's my first attempt:
using System;
using System.Text.RegularExpressions;
class Program {
static void Main() {
string input = "\r\n\r\n\n\r\r\r\n\nthisisatest\r\nstring\r\nwith\nsome" +
"\r\nunsanatized\r\nbreaks\r\n\r\n";
input = (input ?? String.Empty).Trim().Replace("\r", String.Empty);
string output = Regex.Replace(
input,
"\\\n+",
"<br />",
RegexOptions.Multiline);
Console.WriteLine(output);
}
}
producing the following output:
thisisatest<br />string<br />with<br />some<br />unsanatized<br />breaks

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Is there a more elegant way to change Unicode to Ascii? - c#

If they are all replaced with the same string: formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015')); or foreach (char c in "\u2013\u2014\u2015") formattedString = formattedString.Replace(c, '-');

Related

String comparison returns False for same strings [duplicate]

Dealing with escape sequences with ReadOnlySpan<char>

Simplest way to get rid of zero-width-space in c# string

How to count unique characters in string [duplicate]

Remove excessive whitespace in user input field

Categories

Resources