C# - Compare 2 lists with custom elements - c#

I have 2 lists. One contains search element, one contains the data.
I need to loop for each element in list2 which contains any string in list1 ("cat" or "dog"). For examples:
List<string> list1 = new List<string>();
list1.Add("Cat");
list1.Add("Dog");
list1.Add... ~1000 items;
List<string> list2 = new List<string>();
list2.Add("Gray Cat");
list2.Add("Black Cat");
list2.Add("Green Duck");
list2.Add("White Horse");
list2.Add("Yellow Dog Tasmania");
list2.Add("White Horse");
list2.Add... ~million items;
My expect is listResult: {"Gray Cat", "Black Cat", "Yellow Dog Tasmania"} (because it contains "cat" and "dog" in list1). Instead of nested looping, do you have any idea to make the sequence run faster?
My current solution as below. But...it seems too slow:
foreach (string str1 in list1)
{
foreach (string str2 in list2)
{
if str2.Contains(str1)
{
listResult.Add(str2);
}
}
}

An excellent use case for parallelization!
Linq approach without parallelization (equals internally your approach beside the fact that the internal loop breaks if one match was found - your approach also searches for other matches)
List<string> listResult = list2.Where(x => list1.Any(x.Contains)).ToList();
Parallelized the loop with AsParallel() - if you have a multicore system there will be a huge performance improvement.
List<string> listResult = list2.AsParallel().Where(x => list1.Any(x.Contains)).ToList();
Runtime comparison:
(4 core system, list1 1000 items, list2 1.000.000 items)
Without AsParallel(): 91 seconds
With AsParallel(): 23 seconds
The other way with Parallel.ForEach and a thread safe result list
System.Collections.Concurrent.ConcurrentBag<string> listResult = new System.Collections.Concurrent.ConcurrentBag<string>();
System.Threading.Tasks.Parallel.ForEach<string>(list2, str2 =>
{
foreach (string str1 in list1)
{
if (str2.Contains(str1))
{
listResult.Add(str2);
//break the loop if one match was found to avoid duplicates and improve performance
break;
}
}
});
Side note: You have to iterate over list2 first and break; after match, otherwise you add items twice: https://dotnetfiddle.net/VxoRUW

Contains will use a 'naive approach' to string searching. You can improve on that by looking into string search algorithms.
One way to do this could be to create a generalized Suffix tree for all your search words. Then iterate through all the items in your list2 to see if they match.
Still, this might be overkill. You can first try with some simple optimizations as proposed by fubo to see if that's fast enough for you.

The List string is not a suitable data structure for solving this problem efficiently.
What you are looking for is a Trie or Dawg, to sort every word from your original dictionary list1.
The aim is for every letter of word from list2, you will only have 0-26 check.
With this datastructure instead of reading a big list of word till you find one, you will be looking for word like in a paper dictionary. And that should be faster. Application that look for all word from a language in a text use this principle.

Since it seems you want to match entire words, you can use a HashSet to do a more efficient search and prevent iterating list1 and list2 more than once.
HashSet<string> species =
new HashSet<string>(list1);
List<string> result = new List<string>();
foreach (string animal in list2)
{
if (animal.Split(' ').Any(species.Contains))
result.Add(animal);
}
If I run this (with list1 containing 1000 items and list2 containing 100,000 items) on a 4 core laptop:
The algorithm in the question: 37 seconds
The algorithm using AsParallel: 7 seconds
This algorithm: 0.17 seconds
With 1 million items in list2 this algorithm takes about a second.
Now while this approach does work, it might produce incorrect results. If list1 contains Lion then a Sea lion in list2 will be added to the results even if there is none in list1. (If you use a case insensitive StringComparer in the HashSet as suggested below.)
To solve that problem, you would need some way to parse the strings in list2 into a more complex object Animal. If you can control your input, that may be a trivial task, but in general it is hard. If you have some way of doing that, you can use a solution like the following:
public class Animal
{
public string Color { get; set; }
public string Species { get; set; }
public string Breed { get; set; }
}
And then search the species in a HashSet.
HashSet<string> species = new HashSet<string>
{
"Cat",
"Dog",
// etc.
};
List<Animal> animals = new List<Animal>
{
new Animal {Color = "Gray", Species = "Cat"},
new Animal {Color = "Green", Species = "Duck"},
new Animal {Color = "White", Species = "Horse"},
new Animal {Color = "Yellow", Species = "Dog", Breed = "Tasmania"}
// etc.
};
var result = animals.Where(a => species.Contains(a.Species));
Note that the string search in the HashSet is case sensitive, if you do not want that you can supply a StringComparer as constructor argument:
new HashSet<string>(StringComparer.CurrentCultureIgnoreCase)

Related

compare string from 2 different collections and extract the difference

I need to compare 2 sets of string which have some similar names and I need to extract the similar names, how can I do that? They are both collections and lets say one of them is "Sanjay, Race" and the other is "Let, Sanjay", I need to extract Sanjay.
Depends on what data structure you have but I suggest you work with an Array or a List if you collection is big enough to care about optimisation.
You want to go through the first of the two lists, and for each element of list1, compare to compare to every element of list2. Be careful, this might take a while (if your collection is big enough).
Might look like :
using System.Collections.Generic;
LinkedList<string> set1 = new LinkedList<string>();
LinkedList<string> set2 = new LinkedList<string>();
LinkedList<string> extracted = new LinkedList<string>();
//fill in your sets with loops if needed :
see https://learn.microsoft.com/fr-fr/dotnet/api/system.collections.generic.linkedlist-1?view=net-7.0
foreach (string name in set1){
foreach (string name2 in set2){
if(string.Compare(name,name2)==0){
extracted.AddAfter(name);
}
}
}
Please, do correct me (nicely) :)

How to exclude unwanted matches from randomly matched strings

For example I have such a code.
string[] person = new string[] { "Son", "Father", "Grandpa" };
string[] age = new string[] { "Young", "In his 40-s", "Old" };
string[] unwanted = new string { "Old Son", "Young GrandPa" };
Random X = new Random();
string Who = person[i.Next(0, person.Length)];
string HowOld = age[i.Next(0, age.Length)];
Console.WriteLine("{0} {1}", Who, HowOld);
I want to get all random matches BUT THEN exclude two variants from the array "unwanted"). (surely it's just an example, there can be many more arrays and possible bad matches).
What is the good way to do it? The keypoint that I wanna keep the possibility to get ALL results anyway. So I wanna have option to exclude stuff AFTER generation, but not making it impossible to generate "old son".
First, define a class that holds both values from the arrays:
class PersonWithAge
{
public string Person { get; set; }
public string Age { get; set; }
}
Next, use LINQ to generate all possible combinations of Person and Age:
// Create cross product
var results = (from x in person
from y in age
select new PersonWithAge{Person=x, Age=y}).ToList();
Now (if desired) remove the exceptions:
results.RemoveAll(n => n.Person == "Son" && n.Age == "Old"
|| n.Person == "Grandpa" && n.Age == "Young");
If you want to prevent some combination I belief you could have 'rules' of pairs/groups that cannot be matched together like for instance an string[][] blocked or int[][] blocked when, if accessing blocked[i][j], i is the current word and the array blocked[i] are the indexes of the words (or the words themselves) it cannot be matched with (all of this assuming you might have more than 1 word you potentially don't want to match to the current one, in case of just one a simple array will suffice), Then you just have to make sure the random value you use is not one of those 'forbidden'. Hope this helps

Filter a IEnumerable<string> for unwanted strings

Edit: i have received a few very good suggestions i will try to work through them and accept an answer at some point
I have a large list of strings (800k) that i would like to filter in the quickest time possible for a list of unwanted words (ultimately profanity but could be anything).
the result i would ultimately like to see would be a list such as
Hello,World,My,Name,Is,Yakyb,Shell
would become
World,My,Name,Is,Yakyb
after being checked against
Hell,Heaven.
my code so far is
var words = items
.Distinct()
.AsParallel()
.Where(x => !WordContains(x, WordsUnwanted));
public static bool WordContains(string word, List<string> words)
{
for (int i = 0; i < words.Count(); i++)
{
if (word.Contains(words[i]))
{
return true;
}
}
return false;
}
this is currently taking about 2.3 seconds (9.5 w/o parallel) to process 800k words which as a one off is no big deal. however as a learning process is there a quicker way of processing?
the unwanted words list is 100 words long
none of the words contain punctuation or spaces
step taken to remove duplicates in all lists
step to see if working with array is quicker (it isn't) interestingly changing the parameter words to a string[] makes it 25% slower
Step adding AsParallel() has reduced time to ~2.3 seconds
Try the method called Except.
http://msdn.microsoft.com/en-AU/library/system.linq.enumerable.except.aspx
var words = new List<string>() {"Hello","Hey","Cat"};
var filter = new List<string>() {"Cat"};
var filtered = words.Except(filter);
Also how about:
var words = new List<string>() {"Hello","Hey","cat"};
var filter = new List<string>() {"Cat"};
// Perhaps a Except() here to match exact strings without substrings first?
var filtered = words.Where(i=> !ContainsAny(i,filter)).AsParallel();
// You could experiment with AsParallel() and see
// if running the query parallel yields faster results on larger string[]
// AsParallel probably not worth the cost unless list is large
public bool ContainsAny(string str, IEnumerable<string> values)
{
if (!string.IsNullOrEmpty(str) || values.Any())
{
foreach (string value in values)
{
// Ignore case comparison from #TimSchmelter
if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;
//if(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
// return true;
}
}
return false;
}
couple of things
Alteration 1 (nice and simple):
I was able to speed the run (fractionally) by using HashSet over the Distinct method.
var words = new HashSet<string>(items) //this uses HashCodes
.AsParallel()...
Alteration 2 (Bear with me ;) ) :
regarding #Tim's comment, the contains may not provide you with enough for search for black listed words. For example Takeshita is a street name.
you have already identified that you would like the finite state (aka Stemmed) of the word. for example for Apples we would treat it as Apple. To do this we can use stemming algorithms such as the Porter Stemmer.
If we are to Stem a word then we may not need to do Contains(x), we can use the equals(x) or even better compare the HashCodes (the fastest way).
var filter = new HashSet<string>(
new[] {"hello", "of", "this", "and", "for", "is",
"bye", "the", "see", "in", "an",
"top", "v", "t", "e", "a" });
var list = new HashSet<string> (items)
.AsParallel()
.Where(x => !filter.Contains(new PorterStemmer().Stem(x)))
.ToList();
this will compare the words on their hash codes, int == int.
The use of the stemmer did not slowdown the speed as we complemented it with the HashSet (for the filtered list, bigO of 1). And this returned a larger list of results.
I am using the Porter Stemmer located in the Lucene.Net code, this is not threadsafe thus we new one up each time
Issue with Alteration 2, Alteration 2a: as with most Natural language processing, its not simple. What happens when
the word is a combination of banned words "GrrArgh" (where Grr and Argh are banned)
the word is spelt intentionally wrong "Frack", but still has the same meaning as a banned word (sorry to the forum ppl)
the word is spelt with spaces "G r r".
you the band word is not a word but a phrase, poor example: "son of a Barrel"
With forums, they use humans to fulfil these gaps.
Or the introduction of a white list is introduced (given that you have mention the bigO we can say this will have a performance hit of 2n^2, as we are doing 2 lists for every item, do not forget to remove the leading constaints and if i remember correctly you are left with n^2, but im a little rusty on my bigO)
Change your WordContains method to use a single Aho-Corasick search instead of ~100 Contains calls (and of course initialize the Aho-Corasick search tree just once).
You can find a open-sourced implementation here http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=12383.
After initilization of the StringSearch class you will call the method public bool ContainsAny(string text) for each of your 800k strings.
A single call will take O(length of the string) time no matter how long your list of unwanted words is.
I was interested to see if I could come up with a faster way of doing this - but I only managed one little optimization. That was to check the index of a string occuring within another because it firstly seems to be slightly faster than 'contains' and secondly lets you specify case insensitivity (if that is useful to you).
Included below is a test class I wrote - I have used >1 million words and am searching using a case sensitive test in all cases. Its tests your method, and also a regular expression I am trying to build up on the fly. You can try it for yourself and see the timings; the regular expression doesn't work as fast as the method you provided, but then I could be building it incorrectly. I use (?i) before (word1|word2...) to specify case insensitivity in a regular expression (I would love to find out how that could be optimised - it's probably suffering from the classic backtracking problem!).
The searching methods (be it regular expressions or the original method provided) seem to get progressivly slow as more 'unwanted' words are added.
Anyway - hope this simple test helps you out a bit:
class Program
{
static void Main(string[] args)
{
//Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
List<string> loaded = File.ReadAllText(#"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();
List<string> items = new List<string>();
items.AddRange(loaded);
items.AddRange(loaded);
Console.WriteLine("Loaded {0} words", items.Count);
Stopwatch sw = new Stopwatch();
List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
StringBuilder regexBuilder = new StringBuilder("(?i)(");
foreach (string s in WordsUnwanted)
{
regexBuilder.Append(s);
regexBuilder.Append("|");
}
regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
string regularExpression = regexBuilder.ToString();
Console.WriteLine(regularExpression);
List<string> words = null;
bool loop = true;
while (loop)
{
Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
ConsoleKeyInfo testType = Console.ReadKey();
switch (testType.Key)
{
case ConsoleKey.D1:
sw.Reset();
sw.Start();
words = items
.Distinct()
.AsParallel()
.Where(x => !WordContains(x, WordsUnwanted)).ToList();
sw.Stop();
Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D2:
sw.Reset();
sw.Start();
words = items
.Distinct()
.Where(x => !WordContains(x, WordsUnwanted)).ToList();
sw.Stop();
Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D3:
sw.Reset();
sw.Start();
words = items
.Distinct()
.AsParallel()
.Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
sw.Stop();
Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.D4:
sw.Reset();
sw.Start();
words = items
.Distinct()
.Where(x => !Regex.IsMatch(x, regularExpression)).ToList();
sw.Stop();
Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
words = null;
break;
case ConsoleKey.Q:
loop = false;
break;
default:
continue;
}
}
}
public static bool WordContains(string word, List<string> words)
{
for (int i = 0; i < words.Count(); i++)
{
//Found that this was a bit fater and also lets you check the casing...!
//if (word.Contains(words[i]))
if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
return true;
}
return false;
}
}
Ah, filtering words based on matches from a "bad" list. This is a clbuttic problem that has tested the consbreastution of many programmers. My mate from Scunthorpe wrote a dissertation on it.
What you really want to avoid is a solution that tests a word in O(lm), where l is the length of the word to test and m is the number of bad words. In order to do this, you need a solution other than looping through the bad words. I had thought that a regular expression would solve this, but I forgot that typical implementations have an internal data structure that is increased at every alternation. As one of the other solutions says, Aho-Corasick is the algorithm that does this. The standard implementation finds all matches, yours would be more efficient since you could bail out at the first match. I think this provides a theoretically optimal solution.

Removing one items from an array and moving the others backwards

I'll just go straight to the point. I want to move the items in an array in a uniform difference, let's say I have this.
string[] fruits = { "Banana", "Apple", "Watermelon", "Pear", "Mango" };
For example, let's say I want to remove the "Apple" so I'll do this.
fruits[1] = "";
Now all that left are:
{ "Banana", "", "Watermelon", "Pear", "Mango" }
How do I really remove the Apple part and get only:
{ "Banana", "Watermelon", "Pear", "Mango" }
Note that the index of all the items from "Watermelon" until the end of the array moves 1 backward. Any ideas?
The List class is the right one for you. It provides a method Remove which automatically moves the following elements backwards.
If you really want to use Arrays, you can use Linq to filter your list and convert to array:
string[] fruits = { "Banana", "Apple", "Watermelon", "Pear", "Mango" };
fruits = fruits.Where(f => f != "Apple").ToArray();
If you're not required to use an array, look at the List class. A list allows items to be added and removed.
Similar to Wouter's answer, if you want to remove by item index rather than item value, you could do:
fruits = fruits.Where((s, i) => i != 1).ToArray();
You can do something like this:
for( int i = 1; i + 1 < fruits.Length; i++ )
fruits[i] = fruits[i + 1];
fruits = System.Array.Resize( fruits, fruits.Length - 1 );
If you do not care about the order of the fruit in the array, a smarter way to do it is as follows:
fruits[1] = fruits[fruits.Length - 1];
fruits = System.Array.Resize( fruits, fruits.Length - 1 );
I think one of the most useful things a new programmer can do is study and understand the various collection types.
While I think the List option that others have mentioned is probably what you are looking for, it's worth looking at a LinkedList class if you are doing a lot of insertions and deletions and not a lot of looking up by index.
This is an example of how I used lists and arrays to remove an item from an array. Note I also show you how to use linq to search an array full of bad names to remove. Hope this helps someone.
public static void CheckBadNames(ref string[] parts)
{
string[] BadName = new string[] {"LIFE", "ESTATE" ,"(",")","-","*","AN","LIFETIME","INTREST","MARRIED",
"UNMARRIED","MARRIED/UNMARRIED","SINGLE","W/","/W","THE","ET",
"ALS","AS", "TENANT" };
List<string> list = new List<string>(BadName); //convert array to list
foreach(string part in list)
{
if (BadName.Any(s => part.ToUpper().Contains(s)))
{
list.Remove(part);
}
}
parts = list.ToArray(); // convert list back to array
}
As a beginner 3 years ago, I started making the software that I'm still working on today. I used an array for 'PartyMembers' of the game, and I'm basically today regretting it and having to spend a ton of time converting all this hard coded $#!t into a list.
Case in point, just use Lists if you can, arrays a nightmare in comparison.

Get the symmetric difference from generic lists

I have 2 separate List and I need to compare the two and get everything but the intersection of the two lists. How can I do this (C#)?
If you mean the set of everything but the intersection (symmetric difference) you can try:
var set = new HashSet<Type>(list1);
set.SymmetricExceptWith(list2);
You can use Except to get everything but the intersection of the two lists.
var differences = listA.Except(listB).Union(listB.Except(listA));
If you want to get everything but the union:
var allButUnion = new List<MyClass>();
(The union is everything in both lists - everything but the union is the empty set...)
Do you mean everything that's only in one list or the other? How about:
var allButIntersection = a.Union(b).Except(a.Intersect(b));
That's likely to be somewhat inefficient, but it fairly simply indicates what you mean (assuming I've interpreted you correctly, of course).
Here is a generic Extension method. Rosetta Code uses Concat, and Djeefther Souza says it's more efficient.
public static class LINQSetExtensions
{
// Made aware of the name for this from Swift
// https://stackoverflow.com/questions/1683147/get-the-symmetric-difference-from-generic-lists
// Generic implementation adapted from https://www.rosettacode.org/wiki/Symmetric_difference
public static IEnumerable<T> SymmetricDifference<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
// I've used Union in the past, but I suppose Concat works.
// No idea if they perform differently.
return first.Except(second).Concat(second.Except(first));
}
}
I haven't actually benchmarked it. I think it would depend on how Union vs. Concat are implemented. In my dreamworld, .NET uses a different algorithm depending on data type or set size, though for IEnumerable it can't determine set size in advance.
Also, you can pretty much ignore my answer - Jon Skeet says that the HashSet method "Excellent - that looks like the best way of doing it to me."
Something like this?
String[] one = new String[] { "Merry", "Metal", "Median", "Medium", "Malfunction", "Mean", "Measure", "Melt", "Merit", "Metaphysical", "Mental", "Menial", "Mend", "Find" };
String[] two = new String[] { "Merry", "Metal", "Find", "Puncture", "Revise", "Clamp", "Menial" };
List<String> tmp = one.Except(two).ToList();
tmp.AddRange(two.Except(one));
String[] result = tmp.ToArray();
var theUnion = list1.Concat(list2);
var theIntersection = list1.Intersect(list2);
var theSymmetricDifference = theUnion.Except(theIntersection);
Use Except:
List<int> l1 = new List<int>(new[] { 1, 2, 3, 4 });
List<int> l2 = new List<int>(new[] { 2, 4 });
var l3 = l1.Except(l2);

Categories

Resources