Efficient way to find the difference between 2 IEnumerables - c#

I have
IEnumerable<Tuple<string, string>> allInfo
and IEnumerable<string> info1dim. What is a way to find effectively the diff between info1dim and first dim of allInfo. For example :
allInfo = {<"data1", "addinfo1">, <"data2", "addinfo2">, <"data3", "addinfo3">"
and
info1dim = {"data3", "data1", "data4"}
The result I expect is
{"diff4"}
What is the most efficient way to do that?
I don't want to run two loops. The IEnumerables are huge (~100000 elements)

The C# HashSet collection has ExceptWith, UnionWith, and IntersectWith methods. What you want could be done like this.
var set1 = new HashSet<string>(allinfo.Select(t => t.Item1));
var set2 = new HashSet<string>(info1dim);
var set1_but_not_set2 = new HashSet<string>(set1);
set1_but_not_set2.ExceptWith(set2);
var set2_but_not_set1 = new HashSet<string>(set2);
set2_but_not_set1.ExceptWith(set1);
Be careful, though, HashSet is a mutable collection and these functions change the collection. You have O(n) operations here. Constructing the HashSet objects requires iterating; so do the ExceptWith operations.

You could use a LINQ Except() like so:
info1dim.Except(allInfo.Select(i => i.Item1));
Note that Except() uses a HashSet<T> internally (as explained here) so this is still O(n).

Maybe something like this?
var diff = info1dim.Where(x => allInfo.Any(c => c.Item1 == x) == false);
If you store the IEnumerable<Tuple<string, string>> in a Dictionary<string,string> instead it would become ALOT faster! then you could write:
Dictionary<string,string> allInfo;
IEnumerable<string> info1dim;
var diff = info1dim.Where(x => allInfo.ContainsKey(x) == false);

load your info1dim in a HashSet and use Remove foreach item in allInfo :
// n: size of info1dim ; m: size of allInfo
var diff = new HashSet<string> (info1dim); // O(n)
foreach (var tuple in allInfo) // O(m)
diff.Remove (tuple.Item1); // O(1)
I didn't recall of ExceptWith existence before Ollie's answer ; after verifying at the source reference ExceptWith basically do the same (foreach -> Remove) and so should be better ; I keep my code as is as informative support tough

Related

Dictionary .Values and .Keys to array (in same order)

Is there an elegant way to get a dictionary's keys and values in the same order? I am worried that if I use dict.Values.ToArray() and dict.Keys.ToArray() (or dict.Select(obj => obj.Key) and dict.Select(obj => obj.Value)), that they won't be in the same order.
The simple way to execute this is:
foreach (var keyAndVal in dict)
{
keyList.Add(keyAndVal.Key);
valueList.Add(keyAndVal.Value);
}
var keyArray = keyList.ToArray();
var valueArray = valueList.ToArray();
To me, this feels like the kind of thing that LINQ was made for (but I know that dictionary iteration order is not guaranteed to stay the same in two different calls). Is there an elegant (i.e. LINQ, etc.) way to get these in the same order?
Thanks a lot for your help.
As vcsjones points out, for a standard dictionary, the Keys and Values collections will be in the same order. However, if you want a method that will create key and value arrays that will always be in the same order, for any implementation of IDictionary<TKey, TValue>, you could do something like this:
var keyArray = new TKey[dict.Count];
var valueArray = new TValue[dict.Count];
var i = 0;
foreach (var keyAndVal in dict)
{
keyArray[i] = keyAndVal.Key;
valueArray[i] = keyAndVal.Value;
i++;
}

Set of values in one or other list but not both

I am diffing two dictionaries, and I want the set of all keys in or or other dictionary but not both (I don't care about order). Since this only involves the keys, we can do this with the IEnumerables of the keys of the dictionaries.
The easy way, involving 2 passes:
return first.Keys.Except(second.Keys).Concat(second.Keys.Except(first.Keys));
We can concat because the Excepts guarantee the lists will be entirely different.
But I sense there is a better, linqy way to do it.
I prefer a non-LINQy way:
var set = new HashSet<KeyType>(first.Keys);
set.SymmetricExceptWith(second.Keys);
Here's an alternative (but not better) LINQy way to yours:
var result = first.Keys.Union(second.Keys)
.Except(first.Keys.Intersect(second.Keys));
If you're looking for something (possibly) more performant:
var result = new HashSet<KeyType>();
foreach(var firstKey in first.Keys)
{
if(!second.ContainsKey(firstKey))
result.Add(firstKey);
}
foreach(var secondKey in second.Keys)
{
if(!first.ContainsKey(secondKey))
result.Add(secondKey);
}

How to remove x items from collection using LINQ?

Is there a way to remove all items except first one from any type of collection (Control.Items, List ....) using LINQ only ?
No. LINQ is designed for querying collections (no side-effects), not for adding or removing items.
What you can do is write a query that takes the first element of the collection:
var result = source.Take(1);
Note that LINQ doesn't work with all types of collections; you need a LINQ provider to make LINQ work. For instance, source must implement IEnumerable<T> to use the extension methods of the Enumerable Class (LINQ-to-Objects).
How about something using reflection?
static void RemoveButFirst(object o){
Type t = o.GetType();
System.Reflection.MethodInfo rm = t.GetMethod("RemoveAt",
new Type[]{typeof(int)});
System.Reflection.PropertyInfo count = t.GetProperty("Count");
for (int n = (int)(count.GetValue(o,null)) ; n>1; n--)
rm.Invoke(o, new object[]{n-1});
}
This would work any time your collection exposed an int Count property and a RemoveAt(int) method, which I think those collections should.
And a more concise version, using dynamic, if you work with C# 4.0:
public static void RemoveBut(dynamic col, int k){
for (int n = col.Count; n>k; n--)
col.RemoveAt(n-1);
}
You can use .Take(1), but it returns a new collection, and leaves the original intact.
The idea of LINQ came from functional programming where everything is immutable, because of that, they didn't make it possible to modify the collections with LINQ.
Jon Skeet has a comment on the subject: LINQ equivalent of foreach for IEnumerable<T>
How about (in linq):
var result = list.Where(l => l != list.First());
But this would be better:
var result = list.Take(1);
List<string> collection = new List<string>();
collection.RemoveAll(p => p.StartsWith("something"));
listXpto.Where(x=>true /* here goes your query */)
.Select(x=>{listXpto.Remove(x); return null})
But I don´t know the real utility of that.
Remember that the remove method is for ILists, not IQueryable in general.

Linq Query On IDictionaryEnumerator Possible?

I need to clear items from cache that contain a specific string in the key. I have started with the following and thought I might be able to do a linq query
var enumerator = HttpContext.Current.Cache.GetEnumerator();
But I can't? I was hoping to do something like
var enumerator = HttpContext.Current.Cache.GetEnumerator().Key.Contains("subcat");
Any ideas on how I could achieve this?
The Enumerator created by the Cache generates DictionaryEntry objects. Furthermore, a Cache may have only string keys.
Thus, you can write the following:
var httpCache = HttpContext.Current.Cache;
var toRemove = httpCache.Cast<DictionaryEntry>()
.Select(de=>(string)de.Key)
.Where(key=>key.Contains("subcat"))
.ToArray(); //use .ToArray() to avoid concurrent modification issues.
foreach(var keyToRemove in toRemove)
httpCache.Remove(keyToRemove);
However, this is a potentially slow operation when the cache is large: the cache is not designed to be used like this. You should ask yourself whether an alternative design isn't possible and preferable. Why do you need to remove several cache keys at once, and why aren't you grouping cache keys by substring?
Since Cache is an IEnumerable, you can freely apply all LINQ methods you need to it. The only thing you need is to cast it to IEnumerable<DictionaryEntry>:
var keysQuery = HttpContext.Current.Cache
.Cast<DictionaryEntry>()
.Select(entry => (string)entry.Key)
.Where(key => key.Contains("subcat"));
Now keysQuery is a non-strict collection of all keys starting from "subcat". But if you need to remove such entries from cache the simplest way is to just use foreach statement.
I don't think it is a great idea to walk the entire cache anyway, but you could do it non-LINQ with something like:
var iter = HttpContext.Current.Cache.GetEnumerator();
using (iter as IDisposable)
{
while (iter.MoveNext())
{
string s;
if ((s = iter.Key as string) != null && s.Contains("subcat"))
{
//... let the magic happen
}
}
}
to do it with LINQ you could do something like:
public static class Utils
{
public static IEnumerable<KeyValuePair<object, object>> ForLinq(this IDictionaryEnumerator iter)
{
using (iter as IDisposable)
{
while (iter.MoveNext()) yield return new KeyValuePair<object, object>(iter.Key, iter.Value);
}
}
}
and use like:
var items = HttpContext.Current.Cache.GetEnumerator().ForLinq()
.Where(pair => ((string)pair.Key).Contains("subcat"));

Enumerable.ElementAt vs foreach

I have a dictionary which I need to keep updated with incoming data, after parsing the incoming data I have to check if there are any entries in the dictionary which are not present in the incoming data (incoming data when parsed is a list and I need to map it with the dictionary entries).
To avoid multiple loops to removed the entries, I ran a decrementing for loop for dictionary count, then I fetch the dictionary key of the index using ElementAt, then check if the entry is present in the incoming data if not then I remove that entry from the list. I did this because running the foreach loop on the dictionary keys and removing from it will raise and exception as the dictionary keys collection would be modified.
I wanted to understand that doing this will there be any impact on execution time. I want to understand what is the order of ElementAt operation.
ElementAt is useful if you need to provide indexing semantics and cannot guarantee that indexing semantics will be available in the underlying enumeration. It does use O(1) indexing when the enumeration acted upon is an IList<T> (which includes List and arrays), but otherwise is O(n)*, which makes it being used in a sequence over everything go from the O(n) operation it would be with a list to O(n * n).
If however you got a copy of the keys with dict.Keys.ToList() then you could safely foreach through that, as it won't be changed by changes to your dictionary.
What isn't clear is why you don't just replace the old dictionary with the new one, which would be considerably faster again (simple reference assignment).
*Update: In the .NET Core version of linq there are a greater range of cases where ElementAt() is O(1) such as the results of a Select() done on an IList<T>. Also OrderBy(…).ElementAt(…) is now O(n) rather than O(n log n) as the combined sequence is turned into a quick-select rather than a quicksort followed by an iteration.
Use "mark then remove" trick as a workaround for inability to modify collection while iterating.
var dict = new Dictionary<int, string>
{
{3, "kuku" },
{1, "zOl"}
};
var newKeys = new List<int> { 1, 2, 4 };
var toRemove = dict.Keys.Except(newKeys).ToList();
foreach (var k in toRemove)
dict.Remove(k);
ElementAt() does use the enumerator as pointed out, so if you want fastest index access you should use an array. Of course, that comes at the price of a fixed length, but if the size of the array is not constantly changing, it is feasible that Array.Resize() might be the way to go.
Except() seems like it would work here:
Dictionary<int, string> dict = new Dictionary<int, string>
{
{3, "kuku" },
{1, "zOl"}
};
IEnumerable<int> data = new List<int> { 1, 2, 4 };
IEnumerable<int> toRemove = dict.Keys.Except(data);
foreach(var x in toRemove)
dict.Remove(x);
I think that ElementAt() uses the enumerator to get to the required element.
It will be the same as:
object returnedElement = null;
int i = 0;
foreach (var obj in dictionary.Keys)
{
if (i++ == at)
{
returnedElement = obj;
break;
}
}
You can get a Dictionary of the matching entries in target (Dictionary) and source (List) as follows:
using System;
using System.Collections.Generic;
using System.Linq;
Dictionary<string, string> target = new Dictionary<string, string>();
List<string> source = new List<string>();
target.Add("a", "this is a");
target.Add("b", "this is b");
source.Add("a");
source.Add("c");
target = Enumerable.Select(target, n => n.Key).
Where(n => source.Contains(n)).ToDictionary(n => n, k => target[k]);
It's not clear to me if you want to include new entries from the List into the Dictionary - if so, I am not sure what the new entry values would be if you only have a List of new incoming data.

Categories

Resources