Dealing with escape sequences with ReadOnlySpan<char>

Dealing with escape sequences with ReadOnlySpan<char> - c#

The ReadOnlySpan<char> is said to be perfect for parsing so I tried to use it and I came across a use case that I don't know how to handle.
I have a command-line string where the argument prefix - and the separator (space) are escaped (I know I could quote them here but for the sake of this problem let's assume it's not an option):
var str = #"foo -bar \-baz\ qux".AsMemory();
The tokenizer should return the following tokens:
foo - command name
bar - argument name
-baz qux - argument value
Cases 1 & 2 are simple because here I can just use str.Slice(i, length) but how can I create the 3rd case and return only a single ReadOnlySpan<char>? The Slice method doesn't allow me to specify multiple start/length ranges which would be necessary in order to jump over the escape char \.
Example:
str.Slice((10, 4), (15, 3));
where (10,4) = "-bar" and (15,3) = " qux"
With StringBuilder you can just skip a couple of characters and Append the others later. How would I achieve the same result with ReadOnlySpan<char>?

A Span/ReadOnlySpan is a contiguous block of memory. It cannot contain multiple ranges. This design is necessary for performance. Span/ReadOnlySpan is supposed to be roughly as fast as an array is. Arrays are fast because they are contiguous memory blocks with no further abstractions.
I don't see a way to do this without allocating a new string. You can use Span/ReadOnlySpan for all contiguous substrings but it seems your parsing problem is not suitable to use span to store results.

have a look at:
https://github.com/nemesissoft/Nemesis.TextParsers
and more precisely at:
TokenSequence.cs
Usage:
var tokens = "ABC|CD\|E".AsSpan().Tokenize('|', '\\', false); //no allocation. Result in 2 elements: "ABC", "CD\|E".
Consume via:
var result = new List<string>();
foreach (var part in tokens)
result.Add(part.ToString());
Unescaping can be done via:
ParsedSequence.cs
or
SpanParserHelper.UnescapeCharacter()
Hope this helps

Related

String.Contains and String.LastIndexOf C# return different result?

I have this problem where String.Contains returns true and String.LastIndexOf returns -1. Could someone explain to me what happened? I am using .NET 4.5.
static void Main(string[] args)
{
String wikiPageUrl = #"http://it.wikipedia.org/wiki/ʿAbd_Allāh_al-Sallāl";
if (wikiPageUrl.Contains("wikipedia.org/wiki/"))
{
int i = wikiPageUrl.LastIndexOf("wikipedia.org/wiki/");
Console.WriteLine(i);
}
}

While #sa_ddam213's answer definitely fixes the problem, it might help to understand exactly what's going on with this particular string.
If you try the example with other "special characters," the problem isn't exhibited. For example, the following strings work as expected:
string url1 = #"http://it.wikipedia.org/wiki/»Abd_Allāh_al-Sallāl";
Console.WriteLine(url1.LastIndexOf("it.wikipedia.org/wiki/")); // 7
string url2 = #"http://it.wikipedia.org/wiki/~Abd_Allāh_al-Sallāl";
Console.WriteLine(url2.LastIndexOf("it.wikipedia.org/wiki/")); // 7
The character in question, "ʿ", is called a spacing modifier letter1. A spacing modifier letter doesn't stand on its own, but modifies the previous character in the string, this case a "/". Another way to put this is that it doesn't take up its own space when rendered.
LastIndexOf, when called with no StringComparison argument, compares strings using the current culture.
When strings are compared in a culture-sensitive manner, the "/" and "ʿ" characters are not seen as two distinct characters--they're processed into one character, which does not match the parameter passed in to LastIndexOf.
When you pass in StringComparison.Ordinal to LastIndexOf, the characters are treated as distinct, due to the nature of Ordinal comparison.
Another way to make this work would be to use CompareInfo.LastIndexOf and supply the CompareOptions.IgnoreNonSpace option:
Console.WriteLine(
CultureInfo.CurrentCulture.CompareInfo.LastIndexOf(
wikiPageUrl, #"it.wikipedia.org/wiki/", CompareOptions.IgnoreNonSpace));
// 7
Here we're saying that we don't want combining characters included in our string comparison.
As a sidenote, this means that #Partha's answer and #Noctis' answer only work because the character is being applied to a character that doesn't appear in the search string that's passed to LastIndexOf.
Contrast this with the Contains method, which by default performs an Ordinal (case sensitive and culture insensitive) comparison. This explains why Contains returns true and LastIndexOf returns false.
For a fantastic overview of how strings should be manipulated in the .NET framework, check out this article.
1: Is this different than a combining character or is it a type of combining character? would appreciate if someone would clear that up for me.

Try using StringComparison.Ordinal
This will compare the string by evaluating the numeric values of the corresponding chars in each string, this should work with the special chars you have in that example string
string wikiPageUrl = #"http://it.wikipedia.org/wiki/ʿAbd_Allāh_al-Sallāl";
int i = wikiPageUrl.LastIndexOf("http://it.wikipedia.org/wiki/", StringComparison.Ordinal);
// returns 0;

The thing is C# lastindexof looks from behind.
And wikipedia.org/wiki/ is followed by ' which it takes as escape sequence. So either remove ' after wiki/ or have an # there too.
The following syntax will work( anyone )
string wikiPageUrl = #"http://it.wikipedia.org/wiki/Abd_Allāh_al-Sallāl";
string wikiPageUrl = #"http://it.wikipedia.org/wiki/#ʿAbd_Allāh_al-Sallāl";
int i = wikiPageUrl.LastIndexOf("wikipedia.org/wiki");
All 3 works
If you want a generalized solution for this problem replace ' with #' in your string before you perform any operations.

the ' characters throws it off.
This should work, when you escape the ' as \':
wikiPageUrl = #"http://it.wikipedia.org/wiki/\'Abd_Allāh_al-Sallāl";
if (wikiPageUrl.Contains("wikipedia.org/wiki/"))
{
"contains".Dump();
int i = wikiPageUrl.LastIndexOf("wikipedia.org/wiki/");
Console.WriteLine(i);
}
figure out what you want to do (remove the ', escape it, or dig deeper :) ).

Replacing specific numbers with String.Replace or Regex.Replace

Just having a little problem in attempting a string or regex replace on specific numbers in a string.
For example, in the string
#1 is having lunch with #10 #11
I would like to replace "#1", "#10" and "#11" with the respective values as indicated below.
"#1" replace with "#bob"
"#10" replace with "#joe"
"#11" replace with "#sam"
So the final output would look like
"#bob is having lunch with #joe #sam"
Attempts with
String.Replace("#1", "#bob")
results in the following
#bob is having lunch with #bob0 #bob1
Any thoughts on what the solution might be?

I would prefer more declarative way of doing this. What if there will be another replacements, for example #2 change to luke? You will have to change the code (add another Replace call).
My proposition with declarations of the replacements:
string input = "#1 is having lunch with #10 #11";
var rules = new Dictionary<string,string>()
{
{ "#1", "#bob" },
{ "#10", "#joe" },
{ "#11", "#sam"}
};
string output = Regex.Replace(input,
#"#\d+",
match => rules[match.Value]);
Explanation:
Regular expression is searching for pattern #\d+ which means # followed by one or more digits. And replaces this match thanks to MatchEvaluator by the proper entry from the rules dictionary, where the key is the match value itself.

Assuming all placeholder start with # and contain only digits, you can use the Regex.Replace overload that accepts a MatchEvaluator delegate to pick the replacement value from a dictionary:
var regex = new Regex(#"#\d+");
var dict = new Dictionary<string, string>
{
{"#1","#bob"},
{"#10","#joe"},
{"#11","#sam"},
};
var input = "#1 is having lunch with #10 #11";
var result=regex.Replace(input, m => dict[m.Value]);
The result will be "#bob is having lunch with #joe #sam"
There are a few advantages compared to multiple String.Replace calls:
The code is more concise, for an arbitrary number of placeholders
You avoid mistakes due to the order of the replacements (eg #11 must come before #1)
It's faster because you don't need to search and replace the placeholders multiple times
It doesn't create temporary strings for each parameter. This can be an issue for server applications because a large number of orphaned strings will put pressure on the garbage collector
The reason for advantages 3-4 is that the regex will parse the input and create an internal representation that contains the indexes for any match. When the time comes to create the final string, it uses a StringBuilder to read characters from the original string but substitute the replacement values when a match is encountered.

Start with the biggest (read longest) number like #11 and #10 first and then replace #1.
string finalstring = mystring.Replace("#11", "#sam")
.Replace("#10", "#joe")
.Replace("#1", "#bob");

Make your regular expression look for the string #1_
The space after will ensure that it only gets the number #1.

Reading in a text file more 'intelligently'

I have a text file which contains a list of alphabetically organized variables with their variable numbers next to them formatted something like follows:
aabcdef 208
abcdefghijk 1191
bcdefga 7
cdefgab 12
defgab 100
efgabcd 999
fgabc 86
gabcdef 9
h 11
ijk 80
...
...
I would like to read each text as a string and keep it's designated id# something like read "aabcdef" and store it into an array at spot 208.
The 2 issues I'm running into are:
I've never read from file in C#, is there a way to read, say from
start of line to whitespace as a string? and then the next string as
an int until the end of line?
given the nature and size of these files I do not know the highest ID value of each file (not all numbers are used so some
files could house a number like 3000, but only actually list 200
variables) So how could I make a flexible way to store these
variables when I don't know how big the array/list/stack/etc.. would
need to be.

Basically you need a Dictionary instead of an array or list. You can read all lines with File.ReadLines method then split each of them based on space and \t (tab), like this:
var values = File.ReadLines("path")
.Select(line => line.Split(new [] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries))
.ToDictionary(parts => int.Parse(parts[1]), parts => parts[0]);
Then values[208] will give you aabcdef. It looks like an array doesn't it :)
Also make sure you have no duplicate numbers because Dictionary keys should be unique otherwise you will get an exception.

I've been thinking about how I would improve other answers and I've found this alternative solution based on Regex which makes the search into the whole string (either coming from a file or not) safer.
Check that you can alter the whole regular expression to include other separators. Sample expression will detect spaces and tabs.
At the end of the day, I found that MatchCollection returns a safer result, since you always know that 3rd group is an integer and 2nd group is a text because regular expression does a lot of checking for you!
StringBuilder builder = new StringBuilder();
builder.AppendLine("djdodjodo\t\t3893983");
builder.AppendLine("dddfddffd\t\t233");
builder.AppendLine("djdodjodo\t\t39838");
builder.AppendLine("djdodjodo\t\t12");
builder.AppendLine("djdodjodo\t\t444");
builder.AppendLine("djdodjodo\t\t5683");
builder.Append("djdodjodo\t\t33");
// Replace this line with calling File.ReadAllText to read a file!
string text = builder.ToString();
MatchCollection matches = Regex.Matches(text, #"([^\s^\t]+)(?:[\s\t])+([0-9]+)", RegexOptions.IgnoreCase | RegexOptions.Multiline);
// Here's the magic: we convert an IEnumerable<Match> into a dictionary!
// Check that using regexps, int.Parse should never fail because
// it matched numbers only!
IDictionary<int, string> lines = matches.Cast<Match>()
.ToDictionary(match => int.Parse(match.Groups[2].Value), match => match.Groups[1].Value);
// Now you can access your lines as follows:
string value = lines[33]; // <-- By value
Update:
As we discussed in chat, this solution wasn't working in some actual use case you showed me, but it's not the approach what's not working but your particular case, because keys are "[something].[something]" (for example: address.Name).
I've changed given regular expression to ([\w\.]+)[\s\t]+([0-9]+) so it covers the case of key having a dot.
It's about improving the matching regular expression to fit your requirements! ;)
Update 2:
Since you told me that you need keys having any character, I've changed the regular expression to ([^\s^\t]+)(?:[\s\t])+([0-9]+).
Now it means that key is anything excepting spaces and tabs.
Update 3:
Also I see you're stuck in .NET 3.0 and ToDictionary was introduced in .NET 3.5. If you want to get the same approach in .NET 3.0, replace ToDictionary(...) with:
Dictionary<int, string> lines = new Dictionary<int, string>();
foreach(Match match in matches)
{
lines.Add(int.Parse(match.Groups[2].Value), match.Groups[1].Value);
}

Find keyword fastest algorithm in C#

Hello I am trying to create a very fast algorithm to detect keywords or list of keywords in a collection.
Before anything, I have read a lot of stackoverflow (and other) posts without being able to improve the performances to the level I expect.
My current solution is able to analyze an input of 200 chars and a collection of 400 list in 0.1825ms (5 inputs analyzed in 1 ms) but this is way too long and I am hoping to improve this performances by at least 5 times (which is the requirement I have).
Solution tested :
Manual research
Highly complex Regex (groups, backreferences...)
Simple regex called multiple times (to match each of the keyword)
Simple regex to match input keywords followed by an intersect with the tracked keywords (current solution)
Multi-threading (huge impact on performances (*100) so I am not sure that would be the best solution for this problem)
Current solution :
input (string) : string to parse and analyze to verify the keywords list contained in it.
Example : "hello world! How are you Mr #piloupe?".
tracks (string[]) : array of string that we want to match (space means AND). Example : "hello world" matches a string that contains both 'hello' and 'world' whatever their location
keywordList (string[][]) : list of string to match from the input.
Example : { { "hello" }, { "#piloupe" }, { "hello", "world" } }
uniqueKeywords (string[]) : array of string representing all the unique keywords of the keywordList. With the previous keywordList that would be : { "hello", "#piloupe", "world" }
All these previous information does not require any performances improvement as they are constructed only once for any input.
Find the tracks algorithm:
// Store in the class performing the queries
readonly Regex _regexToGetAllInputWords = new Regex(#"\#\w+|\w+", RegexOptions.Compiled);
List<string> GetInputMatches(input)
{
// Extract all the words from the input
var inputWordsMatchCollection = _regexToGetAllInputWords.Matches(input.ToLower()).OfType<Match>().Select(x => x.Value).ToArray();
// Get all the words from the input matching the tracked keywords
var matchingKeywords = uniqueKeywords.Intersect(inputWordsMatchCollection).ToArray();
List<string> result = new List<string>();
// For all the tracks check whether they match
for (int i = 0; i < tracksKeywords.Length; ++i)
{
bool trackIsMatching = true;
// For all the keywords of the track check whether they exist
for (int j = 0; j < tracksKeywords[i].Length && trackIsMatching; ++j)
{
trackIsMatching = matchingKeywords.Contains(tracksKeywords[i][j]);
}
if (trackIsMatching)
{
string keyword = tracks[i];
result.Add(keyword);
}
}
return result;
}
Any help will be greatly appreciated.

The short answer is to parse every word, and store it into a binary tree-like collection. SortedList or SortedDictionary would be your friend here.
With very little code, you can add your words to a SortedList and then do a .BinarySearch() on that SortedList. This is a O(log n) implementation and you should be able to search through thousands or millions of words in a few iterations. When using SortedList, the performance issue will be on the inserts to SortedList (since it will sort while inserting). But this is necessary to do a binary search.
I wouldn't bother with threading since you need results in less than 1ms.
The long answer is to look at something like Lucene, which can be especially helpful if you're doing an autocomplete-style search. RavenDB uses Lucene under the covers and can do background indexing for you, it will search through millions of records in a few milliseconds.

I would like to suggest using hash table.
with hashing you can convert string text to integer representing the index of this string in hash table.
It's much more faster than sequential search.

The ultimate solution is the Elastic binary tree data structure. It is used in HAProxy to match rules against URLs in the proxied HTTP requests (and for many other purposes as well).
ebtree is a data structure built from your 'keyword' patterns, which allows faster matching than either SortedList or hashing. To be faster than hashing is possible because hashing reads the input string once (or at least several characters of it) to generate hash code, then again to evaluate .Equals(). Thus hashing reads all characters of the input 1+ times. ebtree reads all characters at most once and finds the match, or if there's no match, tells it after O(log(n)) characters where n is the number of patterns.
I'm not aware of existing C# implementation of ebtree, but surely many would be pleased if someone would take it.

Is there a more elegant way to change Unicode to Ascii?

I seen the problem a lot where you have some obscure unicode character which is somewhat like a certain ascii character and needs to be converted at run time for whatever reason.
In this case I am trying to export to csv. Having already used a nasty fix for dash, emdash, endash and hbar I have just recieved a new request for ' ` '. Aside from another nasty fix is there another better way to do this?
Heres what I have at the moment...
formattedString = formattedString.Replace(char.ConvertFromUtf32(8211), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8212), "-");
formattedString = formattedString.Replace(char.ConvertFromUtf32(8213), "-");
Any Ideas?

It's a rather inelegant problem, so no method will really be deeply elegant.
Still, we can certainly improve things. Just which approach will work best will depend on the number of changes that need to be made (and the size of the string to change, though it's often best to assume this either is or could be quite large).
At one replacement character, the approach you use so far - using .Replace is superior, though I would replace char.ConvertFromUtf32(8211) with "\u2013". The effect on performance is negligible but it's more readable, since it's more usual to refer to that character in hexadecimal as in U+2013 than in decimal notation (of course char.ConvertFromUtf32(0x2013) would have the same advantage there, but no advantage on just using the char notation). (One could also just put '–' straight into the code - more readable in some cases, but less so in this where it looks much the same as ‒, — or - to the reader).
I'd also replace the string replace with the marginally faster character replace (in this case at least, where you are replacing a single char with a single char).
Taking this approach to your code it becomes:
formattedString = formattedString.Replace('\u2013', '-');
formattedString = formattedString.Replace('\u2014', '-');
formattedString = formattedString.Replace('\u2015', '-');
Even with as few replacements as 3, this is likely to be less efficient than doing all such replacements in one pass (I'm not going to do a test to find how long formattedString would need to be for this, above a certain number it becomes more efficient to use a single pass even for strings of only a few characters). One approach is:
StringBuilder sb = new StringBuilder(formattedString.length);//we know this is the capacity so we initialise with it:
foreach(char c in formattedString)
switch(c)
{
case '\u2013': case '\u2014': case '\u2015':
sb.Append('-');
default:
sb.Append(c)
}
formattedString = sb.ToString();
(Another possibility is to check if (int)c >= 0x2013 && (int)c <= 0x2015 but the reduction in number of branches is small, and irrelevant if most of the characters you look for aren't numerically close to each other).
With various variants (e.g. if formattedString is going to be output to a stream at some point, it may be best to do so as each final character is obtained, rather than buffering again).
Note that this approach doesn't deal with multi-char strings in your search, but can with strings in your output, e.g. we could include:
case 'ß':
sb.Append("ss");
Now, this is more efficient than the previous, but still becomes unwieldy after a certain number of replacement cases. It also involves many branches, which have their own performance issues.
Let's consider for a moment the opposite problem. Say you wanted to convert characters from a source that was only in the US-ASCII range. You would have only 128 possible characters so your approach could be:
char[] replacements = {/*list of replacement characters*/}
StringBuilder sb = new StringBuilder(formattedString.length);
foreach(char c in formattedString)
sb.Append(replacements[(int)c]);
formattedString = sb.ToString();
Now, this isn't practical with Unicode, which has over assigned 109,000 characters in a range going from 0 to 1114111. However, chances are the characters you care about are not only much smaller than that (and if you really did care about that many cases, you'd want the approach given just above) but also in a relatively restricted block.
Consider also if you don't especially care about any surrogates (we'll come to those later). Well, most characters you just don't care about, so, let's consider this:
char[] unchanged = new char[128];
for(int i = 0; i != 128; ++i)
unchanged[i] = (char)i;
char[] error = new string('\uFFFD', 128).ToCharArray();
char[] block0 = (new string('\uFFFD', 13) + "---" + new string('\uFFFD', 112)).ToCharArray();
char[][] blocks = new char[8704][];
for(int i = 1; i != 8704; ++i)
blocks[i] = error;
blocks[0] = unchanged;
blocks[64] = block0;
/* the above need only happen once, so it could be done with static members of a helper class that are initialised in a static constructor*/
StringBuilder sb = new StringBuilder(formattedString.Length);
foreach(char c in formattedString)
{
int cAsI = (int)c;
sb.Append(blocks[i / 128][i % 128]);
}
string ret = sb.ToString();
if(ret.IndexOf('\uFFFD') != -1)
throw new ArgumentException("Unconvertable character");
formattedString = ret;
The balance between whether it's better to test for an uncovertable character in one go at the end (as above) or on each conversion varies according to how likely this is to happen. It's obviously even better if you can be sure (due to knowledge of your data) that it won't, and can remove that check - but you have to be really sure.
The advantage here is that while we are using a look-up method, we are only taking up 384 characters' worth of memory to hold the look-up (and some more for the array overhead) rather than 109,000 characters' worth. The best size for the blocks within this varies according to your data, (that is, what replacements you want to make), but the assumption that there will be blocks that are identical to each other tends to hold.
Now, finally, what if you care about a character in the "astral planes" which are represented as surrogate pairs in the UTF-16 used internally in .NET, or if you care about replacing some multi-char strings in a particular way?
In this case, you are probably going to have to at the very least read a character or more ahead in your switch (if using the block-method for most cases, you can use an unconvertable case to signal such work is required). In such a case, it might well be worth converting to and then back from US-ASCII with System.Text.Encoding and a custom implementation of EncoderFallback and EncoderFallbackBuffer and handle it there. This means that most of the conversion (the obvious cases) will be done for you, while your implementation can deal only with the special cases.

You could maintain a lookup table that maps your problem characters to replacement characters. For efficiency you can work on character array to prevent lots of intermediary string churn which would be a result of using string.Replace.
For example:
var lookup = new Dictionary<char, char>
{
{ '`', '-' },
{ 'இ', '-' },
//next pair, etc, etc
};
var input = "blah இ blah ` blah";
var r;
var result = input.Select(c => lookup.TryGetValue(c, out r) ? r : c);
string output = new string(result.ToArray());
Or if you want blanket treatment of non ASCII range characters:
string output = new string(input.Select(c => c <= 127 ? c : '-').ToArray());

Unfortunately, given that you're doing a bunch of specific transforms within your data, you will likely need to do these via replacements.
That being said, you could make a few improvements.
If this is common, and the strings are long, storing these in a StringBuilder instead of a string would allow in-place replacements of the values, which could potentially improve things.
You could store the conversion characters, both from and to, in a Dictionary or other structure, and perform these operations in a simple loop.
You could load both the "from" and "to" character at runtime from a configuration file, instead of having to hard-code every transformation operation. Later, when more of these were requested, you wouldn't need to alter your code - it could be done via configuration.

If they are all replaced with the same string:
formattedString = string.Join("-", formattedString.Split('\u2013', '\u2014', '\u2015'));
or
foreach (char c in "\u2013\u2014\u2015")
formattedString = formattedString.Replace(c, '-');

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.