Remove elements from one Hashset where NOT in another Hashset?

Remove elements from one Hashset where NOT in another Hashset? - c#

I am aware of hashsetA.Except(hashsetB) to remove elements from hashsetA that exist in hashsetB. However, I want to remove elements from hashsetA that don't exist in hashsetB.
Currently I just copy hashsetA to a new Hashset then use ExceptWith() twice:
hashsetC = new HashSet<var>(hashsetA);
hashsetC.ExceptWith(hashsetB);
hashsetA.ExceptWith(hashsetC);
The performance of this is plenty good enough for my purposes, but I was wondering if there's a built in method to make this faster/more concise?
Or am I missing an obvious way to select from the sets?

Simply use IntersectWith method.
hashsetA.IntersectWith(hashsetB);

res = hashsetA.Where(p=> hashsetB.Contains(p)).
Given that lookup in a Hashset is O(1), that should sum to O(n).

Related

Order IEnumerable without LINQ

Can you advice any solution of how to sort IEnumerable<byte> indexes in .NET 3.0 (no LINQ)?
Of course it is possible to determine indexes length, create array, copy element-by-element, then call Array.Sort(array). But may be can you suggest anything else?

As long as you aren't using the 2.0 compiler (so: VS 2008 / 2010 / 2012), you can use LINQBridge, and use LINQ-to-Objects from your .NET 2.0/3.0 code.
The other lazy solution is:
List<byte> list = new List<byte>(indexes);
list.Sort();
// list is now a sorted clone of the data

Don't think there is any other solution then iterating over "manually", in C# 2.0
Another option of creating array.
You can create a List<>
var list = new List<byte>(indexes );
list.Sort(delegate(byte b1, byte b2)
{
//your comparison logic here
});
It's more compact then simple for or foreach iteration over collection.

The entire IEnumerable<> has to be read when you sort it, so there is no way around that. Even the Linq to Objects method Sort keeps the entire collection in memory.
Create a List<byte> from the IEnumerable<byte> and sort it:
List<byte> list = new List<byte>(indexes);
list.Sort();

Since you can't really change an IEnumerable, you're going to have to copy the data somewhere else to sort it.
However, note you're sorting bytes, you can use Bucket Sort for ultra-efficient sorting.

http://www.codeproject.com/Articles/80546/Comparison-Sorting-Algorithms-in-C-Explained
This came in handy when i was on a search for a solution

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

Best Collection for Fast String Lookup

I need a list of strings and a way to quickly determine if a string is contained within that list.
To enhance lookup speed, I considered SortedList and Dictionary; however, both work with KeyValuePairs when all I need is a single string.
I know I could use a KeyValuePair and simply ignore the Value portion. But I do prefer to be efficient and am just wondering if there is a collection better suited to my requirements.

If you're on .NET 3.5 or higher, use HashSet<String>.
Failing that, a Dictionary<string, byte> (or whatever type you want for the TValue type parameter) would be faster than a SortedList if you have a lot of entries - the latter will use a binary search, so it'll be O(log n) lookup, instead of O(1).

If you just want to know if a string is in the set use HashSet<string>

This sounds like a job for
var keys = new HashSet<string>();
Per MSDN: The Contains function has O(1) complexity.
But you should be aware that it does not give an error for duplicates when adding.

HashSet<string> is like a Dictionary, but with only keys.

If you feel like rolling your own data structure, use a Trie.
http://en.wikipedia.org/wiki/Trie
worst-case is if the string is present: O(length of string)

I know this answer is a bit late to this party, but I was running into an issue where our systems were running slow. After profiling we found out there was a LOT of string lookups happening with the way we had our data structures structured.
So we did some research, came across these benchmarks, did our own tests, and have switched over to using SortedList now.
if (sortedlist.ContainsKey(thekey))
{
//found it.
}
Even though a Dictionary proved to be faster, it was less code we had to refactor, and the performance increase was good enough for us.
Anyway, wanted to share the website in case other people are running into similar issues. They do comparisons between data structures where the string you're looking for is a "key" (like HashTable, Dictionary, etc) or in a "value" (List, Array, or in a Dictionary, etc) which is where ours are stored.

I know the question is old as hell, but I just had to solve the same problem, only for a very small set of strings(between 2 and 4).
In my case, I actually used manual lookup over an array of strings which turned up to be much faster than HashSet<string>(I benchmarked it).
for (int i = 0; i < this.propertiesToIgnore.Length; i++)
{
if (this.propertiesToIgnore[i].Equals(propertyName))
{
return true;
}
}
Note, that it is better than hash set for only for tiny arrays!
EDIT: works only with a manual for loop, do not use LINQ, details in comments

What is the most performant way to check for existence with a collection of integers?

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.
At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);
Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.
Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?

Use a HashSet<T>:
The HashSet class provides high
performance set operations. A set is a
collection that contains no duplicate
elements, and whose elements are in no
particular order
HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.

Sounds like a job for a Hashset...

If you are using framework 3.5 you can use the HashSet collection.
Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.
If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.

If the set of numbers is sparse, then as others suggest use a HashSet.
But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.

What about doing:
list.Distinct().Count() != list.Count()
I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

In C#, is there a kind of a SortedList<double> that allows fast querying (with LINQ) for the nearest value?

I am looking for a structure that holds a sorted set of double values. I want to query this set to find the closest value to a specified reference value.
I have looked at the SortedList<double, double>, and it does quite well for me. However, since I do not need explicit key/value pairs. this seems to be overkill to me, and i wonder if i could do faster.
Conditions:
The structure is initialised only once, and does never change (no insert/deletes)
The amount of values is in the range of 100k.
The structure is queried often with new references, which must execute fast.
For simplicity and speed, the set's value just below of the reference may be returned, not actually the nearest value
I want to use LINQ for the query, if possible, for simplicity of code.
I want to use no 3rd party code if possible. .NET 3.5 is available.
Speed is more importand than memory footprint
I currently use the following code, where SortedValues is the aforementioned SortedList
IEnumerable<double> nearest = from item in SortedValues.Keys
where item <= suggestion
select item;
return nearest.ElementAt(nearest.Count() - 1);
Can I do faster?
Also I am not 100% percent sure, if this code is really safe. IEnumerable, the return type of my query is not by definition sorted anymore. However, a Unit test with a large test data base has shown that it is in practice, so this works for me. Have you hints regarding this aspect?
P.S. I know that there are many similar questions, but none actually answers my specific needs. Especially there is this one C# Data Structure Like Dictionary But Without A Value, but the questioner does just want to check the existence not find anything.

The way you are doing it is incredibly slow as it must search from the beginning of the list each time giving O(n) performance.
A better way is to put the elements into a List and then sort the list. You say you don't need to change the contents once initialized, so sorting once is enough.
Then you can use List<T>.BinarySearch to find elements or to find the insertion point of an element if it doesn't already exist in the list.
From the docs:
Return Value
The zero-based index of
item in the sorted List<T>,
if item is found; otherwise, a
negative number that is the bitwise
complement of the index of the next
element that is larger than item or,
if there is no larger element, the
bitwise complement of Count.
Once you have the insertion point, you need to check the elements on either side to see which is closest.

Might not be useful to you right now, but .Net 4 has a SortedSet class in the BCL.

I think it can be more elegant as follows:
In case your items are not sorted:
double nearest = values.OrderBy(x => x.Key).Last(x => x.Key <= requestedValue);
In case your items are sorted, you may omit the OrderBy call...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.