Sorting a MatchCollection in C#

Sorting a MatchCollection in C# - c#

I am trying to rewrite TCL code in C#. The code of concern is the following:
set list [regexp -all -inline -line {.+\d+.+\d+} $string]
In this case the regexp procedure returns a list of all matches in the string after which I am sorting this list of strings with another expression based on a numeric value in the end of the string:
set sortedList [lsort -decreasing -integer -index end $list]
The question is, how to achieve the same in C#? I tried the following:
MatchCollection mc = Regex.Matches(inputString, regexPattern, RegexOptions.Multiline);
As I found however, I cannot sort a matchcollection directly in C# so I copied every match to an array:
string[] arrayOfMatches = new string[mc.Count];
for (int i = 0; i < mc.Count; i++)
{
arrayOfMatches[i] = mc[i].Groups[1].Value;
}
However, when I try to sort the arrayOfMatches array, I do not see the Sort method available. What am I missing and am I moving in the right direction? Thanks!

To sort arrays, you use the static Array.Sort() method. That said, to sort the matches you would need to define an IComparer. Perhaps an easier way to do this would be to use a little linq-fu:
var mc = Regex.Matches(input, patter);
var matches = new Match[mc.Count];
mc.CopyTo(matches, 0);
var sorted = matches
.Select(x => x.Groups[1].Value)
.OrderBy(x => x);
Sorted will be the value of 2nd item the groups array sorted in ascending order. How it works is the .Select creates the projection you want and the .OrderBy sorts the stack.

The Array.Sort() method is static, so you have to call it like this:
Array.Sort(arrayOfMatches, comparison);
Where comparison is either a delegate that can compare two strings or an implementation of IComparer<T> that can do the same.
But it might be easier to use LINQ:
var matches =
from Match m in mc
let value = m.Groups[1].Value
let numericValue = int.Parse(value)
orderby numericValue descending
select value;
This assumes the whole value is the number. If I understand you correctly and you want to get a numeric value from the end of the string, you would have to add code to do that.

Related

Sort Array by contains specific char count

I have an array and I want to sort this array by its element' specific character count.
var myNewArray = myArray.ToList().Sort(u => u.Name.Split(' ').Length);
but this does not work at all.
How can I provide the LINQ code for this problem ?
myArray[0] = "word1 word2"
myArray[1] = "word1"
myArray[2] = "word3 word2 word2 word2"
when Apply sort my array element order must be like
myArray[2],myArray[0],myArray[1]

Use:
var myNewArray = myArray.OrderByDescending(u => u.Name.Split(' ').Length).ToList();
To count the number of words

User OrderByDescending instead
var myNewArray = myArray.OrderByDescending(u => u.Name.Split(' ').Length).ToList();
This will save you producing two in-memory lists as well

Your code will not compile List.Sort modifies the current list in place, it doesn't return a new collection.
Having said that, you need Enumerable.OrderByDescending
sentence which have more words must be top of the array
Since you have an Array to begin with you can simply do:
var myNewArray = myArray.OrderByDescending(u => u.Name.Split(' ').Length).ToArray();
Make sure to include using System.Linq;
(Remove ToArray if you only need an IEnumerable<T>)

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.

Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.

In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.

This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.

I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

String to dictionary using regex (want to optimize)

I have string on the format "$0Option one$1$Option two$2$Option three" (etc) that I want to convert into a dictionary where each number corresponds to an option. I currently have a working solution for this problem, but since this method is called for every entry I'm importing (few thousand) I want it to be as optimized as possible.
public Dictionary<string, int> GetSelValsDictBySelValsString(string selectableValuesString)
{
// Get all numbers in the string.
var correspondingNumbersArray = Regex.Split(selectableValuesString, #"[^\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
List<int> correspondingNumbers = new List<int>();
int number;
foreach (string s in correspondingNumbersArray)
{
Int32.TryParse(s, out number);
correspondingNumbers.Add(number);
}
selectableValuesString = selectableValuesString.Replace("$", "");
var selectableStringValuesArray = Regex.Split(selectableValuesString, #"[\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
var selectableValues = new Dictionary<string, int>();
for (int i = 0; i < selectableStringValuesArray.Count(); i++)
{
selectableValues.Add(selectableStringValuesArray.ElementAt(i), correspondingNumbers.ElementAt(i));
}
return selectableValues;
}

The first thing that caught my attention in your code is that it processes the input string three times: twice with Split() and once with Replace(). The Matches() method is a much better tool than Split() for this job. With it, you can extract everything you need in a single pass. It makes the code a lot easier to read, too.
The second thing I noticed was all those loops and intermediate objects. You're using LINQ already; really use it, and you can eliminate all of that clutter and improve performance. Check it out:
public static Dictionary<int, string> GetSelectValuesDictionary(string inputString)
{
return Regex.Matches(inputString, #"(?<key>[0-9]+)\$*(?<value>[^$]+)")
.Cast<Match>()
.ToDictionary(
m => int.Parse(m.Groups["key"].Value),
m => m.Groups["value"].Value);
}
notes:
Cast<Match>() is necessary because MatchCollection only advertises itself as an IEnumerable, and we need it to be an IEnumerable<Match>.
I used [0-9] instead of \d on the off chance that your values might contain digits from non-Latin writing systems; in .NET, \d matches them all.
Static Regex methods like Matches() automatically cache the Regex objects, but if this method is going to be called a lot (especially if you're using a lot of other regexes, too), you might want to create a static Regex object anyway. If performance is really critical, you can specify the Compiled option while you're at it.
My code, like yours, makes no attempt to deal with malformed input. In particular, mine will throw an exception if the number turns out to be too large, while yours just converts it to zero. This probably isn't relevant to your real code, but I felt compelled to express my unease at seeing you call TryParse() without checking the return value. :/
You also don't make sure your keys are unique. Like #Gabe, I flipped it around used the numeric values as the keys, because they happened to be unique and the string values weren't. I trust that, too, is not a problem with your real data. ;)

Your selectableStringValuesArray is not actually an array! This means that every time you index into it (with ElementAt or count it with Count) it has to rerun the regex and walk through the list of results looking for non-whitespace. You need something like this instead:
var selectableStringValuesArray = Regex.Split(selectableValuesString, #"[\d]+").Where(x => (!String.IsNullOrWhiteSpace(x))).ToArray();
You should also fix your correspondingNumbersString because it has the same problem.
I see you're using C# 4, though, so you can use Zip to combine the lists and then you wouldn't have to create an array or use any loops. You could create your dictionary like this:
return correspondingNumbersString.Zip(selectableStringValuesArray,
(number, str) => new KeyValuePair<int, string>(int.Parse(number), str))
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}

If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

How to Compare Values in Array

If you have a string of "1,2,3,1,5,7" you can put this in an array or hash table or whatever is deemed best.
How do you determine that all value are the same? In the above example it would fail but if you had "1,1,1" that would be true.

This can be done nicely using lambda expressions.
For an array, named arr:
var allSame = Array.TrueForAll(arr, x => x == arr[0]);
For an list (List<T>), named lst:
var allSame = lst.TrueForAll(x => x == lst[0]);
And for an iterable (IEnumerable<T>), named col:
var first = col.First();
var allSame = col.All(x => x == first);
Note that these methods don't handle empty arrays/lists/iterables however. Such support would be trivial to add however.

Iterate through each value, store the first value in a variable and compare the rest of the array to that variable. The instant one fails, you know all the values are not the same.

How about something like...
string numArray = "1,1,1,1,1";
return numArrray.Split( ',' ).Distinct().Count() <= 1;

I think using List<T>.TrueForAll would be a slick approach.
http://msdn.microsoft.com/en-us/library/kdxe4x4w.aspx

Not as efficient as a simple loop (as it always processes all items even if the result could be determined sooner), but:
if (new HashSet<string>(numbers.Split(',')).Count == 1) ...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.