Fastest way to compare two lists

Fastest way to compare two lists - c#

I have a List (Foo) and I want to see if it's equal to another List (foo). What is the fastest way ?

From 3.5 onwards you may use a LINQ function for this:
List<string> l1 = new List<string> {"Hello", "World","How","Are","You"};
List<string> l2 = new List<string> {"Hello","World","How","Are","You"};
Console.WriteLine(l1.SequenceEqual(l2));
It also knows an overload to provide your own comparer

Here are the steps I would do:
Do an object.ReferenceEquals() if true, then return true.
Check the count, if not the same, return false.
Compare the elements one by one.
Here are some suggestions for the method:
Base the implementation on ICollection. This gives you the count, but doesn't restrict to specific collection type or contained type.
You can implement the method as an extension method to ICollection.
You will need to use the .Equals() for comparing the elements of the list.

Something like this:
public static bool CompareLists(List<int> l1, List<int> l2)
{
if (l1 == l2) return true;
if (l1.Count != l2.Count) return false;
for (int i=0; i<l1.Count; i++)
if (l1[i] != l2[i]) return false;
return true;
}
Some additional error checking (e.g. null-checks) might be required.

Something like this maybe using Match Action.
public static CompareList<T>(IList<T> obj1, IList<T> obj2, Action<T,T> match)
{
if (obj1.Count != obj2.Count) return false;
for (int i = 0; i < obj1.Count; i++)
{
if (obj2[i] != null && !match(obj1[i], obj2[i]))
return false;
}
}

Assuming you mean that you want to know if the CONTENTS are equal (not just the list's object reference.)
If you will be doing the equality check much more often than inserts then you may find it more efficient to generate a hashcode each time a value is inserted and compare hashcodes when doing the equality check. Note that you should consider if order is important or just that the lists have identical contents in any order.
Unless you are comparing very often I think this would usually be a waste.

One shortcut, that I didn't see mentioned, is that if you know how the lists were created, you may be able to join them into strings and compare directly.
For example...
In my case, I wanted to prompt the user for a list of words. I wanted to make sure that each word started with a letter, but after that, it could contain letters, numbers, or underscores. I'm particularly concerned that users will use dashes or start with numbers.
I use Regular Expressions to break it into 2 lists, and them join them back together and compare them as strings:
var testList = userInput.match(/[-|\w]+/g)
/*the above catches common errors:
using dash or starting with a numeric*/
listToUse = userInput.match(/[a-zA-Z]\w*/g)
if (listToUse.join(" ") != testList.join(" ")) {
return "the lists don't match"
Since I knew that neither list would contain spaces, and that the lists only contained simple strings, I could join them together with a space, and compare them.

Related

Search for an existing object in a list

This is my first question here so I hope I'm doing right.
I have to create a List of array of integer:
List<int[]> finalList = new List<int[]>();
in order to store all the combinations of K elements with N numbers.
For example:
N=5, K=2 => {1,2},{1,3},{1,4},...
Everything is all right but I want to avoid the repetitions of the same combination in the list({1,2} and {2,1} for example). So before adding the tmpArray (where I temporally store the new combination) in the list, I want to check if it's already stored.
Here it's what I'm doing:
create the tmpArray with the next combination (OK)
sort tmpArray (OK)
check if the List already contains tmpArray with the following code:
if (!finalList.Contains(tmpArray))
finalList.Add(tmpArray);
but it doesn't work. Can anyone help me with this issue?

Array is a reference type - your Contains query will not do what you want (compare all members in order).
You may use something like this:
if (!finalList.Any(x => x.SequenceEqual(tmpArray))
{
finalList.Add(tmpArray);
}
(Make sure you add a using System.Linq to the top of your file)
I suggest you learn more about value vs. reference types, Linq and C# data structure fundamentals. While above query should work it will be slow - O(n*m) where n = number of arrays in finalList and m length of each array.
For larger arrays some precomputing (e.g. a hashcode for each of the arrays) that allows you a faster comparison might be beneficial.

If I remember correctly, contains will either check the value for value data types or it will check the address for object types. An array is an object type, so the contains is only checking if the address in memory is stored in your list. You'll have to check each item in this list and perform some type of algorithm to check that the values of the array are in the list.
Linq, Lambda, or brute force checking comes to mind.
BrokenGlass gives a good suggestion with Linq and Lambda.
Brute Force:
bool itemExists = true;
foreach (int[] ints in finalList)
{
if (ints.Length != tmpArray.Length)
{
itemExists = false;
break;
}
else
{
// Compare each element
for (int i = 0; i < tmpArray.Length; i++)
{
if (ints[i] != tmpArray[i])
{
itemExists = false;
break;
}
}
// Have to check to break from the foreach loop
if (itemExists == false)
{
break;
}
}
}
if (itemExists == false)
{
finalList.add(tmpArray);
}

Check if Characters in ArrayList C# exist - C# (2.0)

I was wondering if there is a way in an ArrayList that I can search to see if the record contains a certain characters, If so then grab the whole entire sentence and put in into a string. For Example:
list[0] = "C:\Test3\One_Title_Here.pdf";
list[1] = "D:\Two_Here.pdf";
list[2] = "C:\Test\Hmmm_Joke.pdf";
list[3] = "C:\Test2\Testing.pdf";
Looking for: "Hmmm_Joke.pdf"
Want to get: "C:\Test\Hmmm_Joke.pdf" and put it in the Remove()
protected void RemoveOther(ArrayList list, string Field)
{
string removeStr;
-- Put code in here to search for part of a string which is Field --
-- Grab that string here and put it into a new variable --
list.Contains();
list.Remove(removeStr);
}
Hope this makes sense. Thanks.

Loop through each string in the array list and if the string does not contain the search term then add it to new list, like this:
string searchString = "Hmmm_Joke.pdf";
ArrayList newList = new ArrayList();
foreach(string item in list)
{
if(!item.ToLower().Contains(searchString.ToLower()))
{
newList.Add(item);
}
}
Now you can work with the new list that has excluded any matches of the search string value.
Note: Made string be lowercase for comparison to avoid casing issues.

In order to remove a value from your ArrayList you'll need to loop through the values and check each one to see if it contains the desired value. Keep track of that index, or indexes if there are many.
Then after you have found all of the values you wish to remove, you can call ArrayList.RemoveAt to remove the values you want. If you are removing multiple values, start with the largest index and then process the smaller indexes, otherwise, the indexes will be off if you remove the smallest first.

This will do the job without raising an InvalidOperationException:
string searchString = "Hmmm_Joke.pdf";
foreach (string item in list.ToArray())
{
if (item.IndexOf(searchString, StringComparison.OrdinalIgnoreCase) >= 0)
{
list.Remove(item);
}
}
I also made it case insensitive.
Good luck with your task.

I would rather use LINQ to solve this. Since IEnumerables are immutable, we should first get what we want removed and then, remove it.
var toDelete = Array.FindAll(list.ToArray(), s =>
s.ToString().IndexOf("Hmmm_Joke.pdf", StringComparison.OrdinalIgnoreCase) >= 0
).ToList();
toDelete.ForEach(item => list.Remove(item));
Of course, use a variable where is hardcoded.
I would also recommend read this question: Case insensitive 'Contains(string)'
It discuss the proper way to work with characters, since convert to Upper case/Lower case since it costs a lot of performance and may result in unexpected behaviours when dealing with file names like: 文書.pdf

Why does my comparison always return false?

I have bunch of usernames in this arraylist and I want to check whether username exists in the arraylist or not but the method always returns false.
public bool check_username(ArrayList userList, string username)
{
for (int i = 0; i < userList.Count; i++)
{
if (userList[i].ToString() == username)
{
return true;
}
}
return false;
}

Consider making your string comparison case insensitive.
username.Equals(userList[i].ToString(), StringComparison.OrdinalIgnoreCase);
Or, assuming all elements of your ArrayList userList are strings, and you're using .NET 3.5 or later, you can simplify this using LINQ:
public bool check_username(ArrayList userList, string username)
{
return userList.Cast<string>()
.Any(s => s.Equals(username, StringComparison.OrdinalIgnoreCase);
}

Without seeing your list or what you are passing we can't know for sure, but it could be a normalization problem.
Bob, bob, BOB, BOb etc... are not the same when comparing strings.
Replace your if statement with this:
if(userList[i].ToString().ToLower() == username.ToLower())
{
return true;
}

There are several reasons I could think of that could cause that function to always return false.
Are you sure that your userList and username will always have the same casing? It would be a good practice to use .ToLower() or .ToUpper() to insure that the casing matches unless you intend for casing to be part of the match.
Are you sure there is no extra whitespace on either string? It is good practice to use .Trim() when you are comparing strings where there may be extra whitespace.
Using the .Equals() method when comparing string is more reliable than the logical operator ==. Occasionally the logical operator produces incorrect results.
Are you positive that you should be getting a true result? Is it possible one string contains a hidden character that you are unaware of? Use the debugger to check the values of the strings to be sure.

Examining two string arrays for equivalence

Is there a better way to examine whether two string arrays have the same contents than this?
string[] first = new string[]{"cat","and","mouse"};
string[] second = new string[]{"cat","and","mouse"};
bool contentsEqual = true;
if(first.Length == second.Length){
foreach (string s in first)
{
contentsEqual &= second.Contains(s);
}
}
else{
contentsEqual = false;
}
Console.WriteLine(contentsEqual.ToString());// true

Enumerable.SequenceEquals if they're supposed to be in the same order.

You should consider using the intersect method. It will give you all the matching values and then you can just compare the count of the resulting array with one the arrays that were compared.
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.intersect.aspx

This is O(n^2). If the arrays have the same length, sort them, then compare elements in the same position. This is O(n log n).
Or you can use a hash set or dictionary: insert each word in the first array, then see if every word in the second array is in the set or dictionary. This is O(n) on average.

Nothing wrong with the logic of the method, but the fact that you're testing Contains for each item in the first sequence means the algorithm runs in O(n^2) time in general. You can also make one or two other smaller optimisations and improvements
I would implement such a function as follows. Define an extension method as such (example in .NET 4.0).
public static bool SequenceEquals<T>(this IEnumerable<T> seq1, IEnumerable<T> seq2)
{
foreach (var pair in Enumerable.Zip(seq1, seq2)
{
if (!pair.Item1.Equals(pair.Item2))
return;
}
return false;
}

You could try Enumerable.Intersect: http://msdn.microsoft.com/en-us/library/bb460136.aspx
The result of the operation is every element that is common to both arrays. If the length of the result is equal to the length of both arrays, then the two arrays contain the same items.
Enumerable.Union: http://msdn.microsoft.com/en-us/library/bb341731.aspx would work too; just check that the result of the Union operation has length of zero (meaning there are no elements that are unique to only one array);
Although I'm not exactly sure how the functions handle duplicates.

C# Efficient Substring with many inputs

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?
Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.
E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.
Oh, and please don't mention regular expressions as a suggested solutions.
It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.
Edit:
A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:
Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.
I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.
As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...
One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.
The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.
The general search algorithm is:
For each string you are searching against:
Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)
Here's an example using a sorted array and binary search in C# 2.0:
NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.
string[] FindKeyWordOccurence( string[] stringsToSearch,
string[] positiveKeywords,
string[] negativeKeywords )
{
Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
foreach( string searchIn in stringsToSearch )
{
// tokenize and sort the input to make searches faster
string[] tokenizedList = searchIn.Split( ' ' );
Array.Sort( tokenizedList );
// if any negative keywords exist, skip to the next search string...
foreach( string negKeyword in negativeKeywords )
if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
continue; // skip to next search string...
// for each positive keyword, add to dictionary to keep track of it
// we could have also used a SortedList, but the dictionary is easier
foreach( string posKeyword in positiveKeyWords )
if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
foundKeywords[posKeyword] = 1;
}
// convert the Keys in the dictionary (our found keywords) to an array...
string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
return foundKeywordsArray;
}
Here's a version that uses a dictionary-based index and LINQ in C# 3.0:
NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.
public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
IEnumerable<string> positiveKeywords,
IEnumerable<string> negativeKeywords )
{
var foundKeywordsDict = new Dictionary<string, int>();
foreach( var searchIn in searchStrings )
{
// tokenize the search string...
var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
// skip if any negative keywords exist...
if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
continue;
// merge found positive keywords into dictionary...
// an example of where Enumerable.ForEach() would be nice...
var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
foreach (var keyword in found)
foundKeywordsDict[keyword] = 1;
}
return foundKeywordsDict.Keys;
}

If you add this extension method:
public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
foreach (var keyword in keywords)
{
if (testString.Contains(keyword))
return true;
}
return false;
}
Then this becomes a one line statement:
var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));
This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

If you're truly just looking for space-delimited words, this code would be a very simple implementation:
static void Main(string[] args)
{
string sIn = "This is a string that isn't nearly as long as it should be " +
"but should still serve to prove an algorithm";
string[] sFor = { "string", "as", "not" };
Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
}
private static string[] FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Intersect(hsFor).ToArray();
}
If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:
private static bool FindAny(string searchIn, string[] searchFor)
{
HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
HashSet<String> hsFor = new HashSet<string>(searchFor);
return hsIn.Overlaps(hsFor);
}

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.
So you get something like:
IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
foreach (Bucket b in buckets) {
if (i + b.Length >= inputString.Length)
continue;
string candidate = inputString.Substring(i, b.Length);
int hash = ComputeHash(candidate);
foreach (MatchString match in b.MatchStrings) {
if (hash != match.Hash)
continue;
if (candidate == match.String) {
if (match.IsPositive) {
// positive case
}
else {
// negative case
}
}
}
}
}

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.
That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.