What is fastest way to find number of matches between arrays?

What is fastest way to find number of matches between arrays? - c#

Currently, I am testing every integer element against each other to find which ones match. The arrays do not contain duplicates within their own set. Also, the arrays are not always equal lengths. Are there any tricks to speed this up? I am doing this thousands of times, so it's starting to become a bottle neck in my program, which is in C#.

You could use LINQ:
var query = firstArray.Intersect(secondArray);
Or if the arrays are already sorted you could iterate over the two arrays yourself:
int[] a = { 1, 3, 5 };
int[] b = { 2, 3, 4, 5 };
List<int> result = new List<int>();
int ia = 0;
int ib = 0;
while (ia < a.Length && ib < b.Length)
{
if (a[ia] == b[ib])
{
result.Add(a[ia]);
ib++;
ia++;
}
else if (a[ia] < b[ib])
{
ia++;
}
else
{
ib++;
}
}

Use a HashSet
var set = new HashSet<int>(firstArray);
set.IntersectWith(secondArray);
The set now contains only the values that exist in both arrays.

If such a comparison is a bottleneck in your program, you are perhaps using an inappropriate data structure. The simplest way might be to keep your data sorted. Then for finding out the common entries, you would need to traverse both arrays only once. Another option would be to keep the data in a HashSet.

Related

How to achieve O(n) worst-case time complexity for this function?

I'm having issues with a certain task. It's not a homework or anything, it's rather a personal matter now. And I want to know if there's even a solution for this...
The point is to achieve expected O(n) worst-case time complexity of a function, that takes 2 string arrays as input (let's call first one A, and the second array B) and should return an array of integers where each element represents an index of the corresponding element in array A.
So, this is how a function should look like:
private static int[] GetExistingStrings(string[] A, string[] B) { ... }
Array A contains all possible names
Array B contains names which should be excluded (i.e. if some of the names stored in B array are also in the A array, their indices should not be included in an output int[] array; it's also possible that this array can contain some random strings which are not necessarily may present in the A array OR it may even be empty.
For example, if we have these arrays:
string[] A = { "one", "two", "three", "four" }; // 0, 1, 2, 3
string[] B = { "two", "three" }; // Indices of "two" and "three" not taken into account
The function should return:
int[] result = { 0, 3 }; // Indices of "one" and "four"
At first, I tried doing it the obvious and simple way (with nested for-loops):
private static int[] GetExistingStrings(string[] A, string[] B)
{
LinkedList<int> aIndices = new LinkedList<int>();
for (int n = 0; n < A.Length; n++)
{
bool isExcluded = false;
for (int m = 0; m < B.Length; m++)
{
if (A[n].Equals(B[m]))
{
isExcluded = true;
break;
}
}
if (!isExcluded)
{
aIndices.AddLast(i);
}
}
int[] resultArray = new int[aIndices.Count];
aIndices.CopyTo(resultArray, 0);
return resultArray;
}
I used LinkedList because we can't possibly know what the ouput's array size should be and also because adding new nodes to this list is a constant O(1) operation. The problem here, of course, is that this function (as I assume) is O(n*M) time complexity. So, we need to find another way...
My second approach was:
private static int[] GetExistingStrings(string[] A, string[] B)
{
int n = A.Length;
int m = B.Length;
if (m == 0)
{
return GetDefaultOutputArray(n);
}
HashSet<string> bSet = new HashSet<string>(B);
LinkedList<int> aIndices = new LinkedList<int>();
for (int i = 0; i < n; i++)
{
if (!bSet.Contains(A[i]))
{
aIndices.AddLast(i);
}
}
if (aIndices.Count > 0)
{
int[] result = new int[aIndices.Count];
aIndices.CopyTo(result, 0);
return result;
}
return GetDefaultOutputArray(n);
}
// Just an utility function that returns a default array
// with length "arrayLength", where first element is 0, next one is 1 and so on...
private static int[] GetDefaultOutputArray(int arrayLength)
{
int[] array = new int[arrayLength];
for (int i = 0; i < arrayLength; i++)
{
array[i] = i;
}
return array;
}
Here the idea was to add all elements of B array to a HashSet and then use it's method Contains() to check for equality in a for-loop. But I can't quite calculate time complexity of this function... I know for sure that the code in the for-loop will execute n times. But what bugs me the most is the HashSet initialization - should it be taken into account here? How does it affects time complexity? is this function O(n)? Or O(n+m) because of HashSet initialization?
Is there any way to solve this task and achieve O(n)?

If you have n elements in A, m elements in B, and the strings are of length k, the expected time of a hashmap approach is O(k*(m + n)). Unfortunately the worst time is O(km(m + n)) if the hashing algorithm doesn't work. (The odds of which are very low.) I had this wrong before, thanks to #PaulHankin for the correction.
To get O(k*(m + n)) worst time we have to take a very different approach. What you do is build a trie out of B. And now you go through each element of A and look it up in the trie. Unlike a hash, a trie has guaranteed worst case performance (and better yet, allows prefix lookups even though we aren't using that). This approach gives us not just expected average time O(k*(m + n)) but also the same worst time.
You cannot do better than this because just processing the lists requires processing O(k*(m + n)) data.

Here is how you could rewrite your second approach using LINQ, while also selecting case-insensitive string comparison:
public static int[] GetExistingStrings(string[] first, string[] second)
{
var secondSet = new HashSet<string>(second, StringComparer.OrdinalIgnoreCase);
return first
.Select((e, i) => (Element : e, Index : i))
.Where(p => !secondSet.Contains(p.Element))
.Select(p => p.Index)
.ToArray();
}
The time and space complexity is the same (O(n)). It's just a more fancy way to do the same thing.

Iterating array - only array cells with values

What is the way to iterate over (Say a 10 cell) array in c# where only the first 4 are populated with values and avoid going through the remaining 6?
I can keep an int for index and modify it on every array add / remove operation,
but was wondering if c# has a built in (and efficient) function to achieve this.
Thanks in advance.

You have 2 performant options as far is as can tell
A for loop, the downside is you have to check the condition each iteration
for(var i = 0; ary[i] != null & i < length; i++)
{
}
However if this is really mission critical, You will have to keep a list (or index of your range), which slower on update, faster on iteration
If you want to squeze out a bit more performance, use fixed and unsafe

Yes there is, a built in TakeWhile method that you can use:
var result = Array.TakeWhile(item => item != string.IsNullOrEmpty(item));
foreach (int value in result)
{
Console.WriteLine(value);
}

You could use a list and then convert it to an array
List<int> list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(5);
int[] arr = list.ToArray();
int sum = 0;
foreach (int value in arr)
{
sum += value;
}

I would do something like this with the use of Lambda Expressions.
I made a little console application.
So it checks for nulls and length < 0 will be also a null and then it won't iterate
It won't crash with IndexOutOfRangeException, like the accepted answer
int?[] array = { 6, null, 4, 3, null, };
array.Where(t => t != null).ToList().ForEach(t => Console.WriteLine(t));
Console.ReadLine();

Get all possible distinct triples using LINQ

I have a List contains these values: {1, 2, 3, 4, 5, 6, 7}. And I want to be able to retrieve unique combination of three. The result should be like this:
{1,2,3}
{1,2,4}
{1,2,5}
{1,2,6}
{1,2,7}
{2,3,4}
{2,3,5}
{2,3,6}
{2,3,7}
{3,4,5}
{3,4,6}
{3,4,7}
{3,4,1}
{4,5,6}
{4,5,7}
{4,5,1}
{4,5,2}
{5,6,7}
{5,6,1}
{5,6,2}
{5,6,3}
I already have 2 for loops that able to do this:
for (int first = 0; first < test.Count - 2; first++)
{
int second = first + 1;
for (int offset = 1; offset < test.Count; offset++)
{
int third = (second + offset)%test.Count;
if(Math.Abs(first - third) < 2)
continue;
List<int> temp = new List<int>();
temp .Add(test[first]);
temp .Add(test[second]);
temp .Add(test[third]);
result.Add(temp );
}
}
But since I'm learning LINQ, I wonder if there is a smarter way to do this?

UPDATE: I used this question as the subject of a series of articles starting here; I'll go through two slightly different algorithms in that series. Thanks for the great question!
The two solutions posted so far are correct but inefficient for the cases where the numbers get large. The solutions posted so far use the algorithm: first enumerate all the possibilities:
{1, 1, 1 }
{1, 1, 2 },
{1, 1, 3 },
...
{7, 7, 7}
And while doing so, filter out any where the second is not larger than the first, and the third is not larger than the second. This performs 7 x 7 x 7 filtering operations, which is not that many, but if you were trying to get, say, permutations of ten elements from thirty, that's 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30 x 30, which is rather a lot. You can do better than that.
I would solve this problem as follows. First, produce a data structure which is an efficient immutable set. Let me be very clear what an immutable set is, because you are likely not familiar with them. You normally think of a set as something you add items and remove items from. An immutable set has an Add operation but it does not change the set; it gives you back a new set which has the added item. The same for removal.
Here is an implementation of an immutable set where the elements are integers from 0 to 31:
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System;
// A super-cheap immutable set of integers from 0 to 31 ;
// just a convenient wrapper around bit operations on an int.
internal struct BitSet : IEnumerable<int>
{
public static BitSet Empty { get { return default(BitSet); } }
private readonly int bits;
private BitSet(int bits) { this.bits = bits; }
public bool Contains(int item)
{
Debug.Assert(0 <= item && item <= 31);
return (bits & (1 << item)) != 0;
}
public BitSet Add(int item)
{
Debug.Assert(0 <= item && item <= 31);
return new BitSet(this.bits | (1 << item));
}
public BitSet Remove(int item)
{
Debug.Assert(0 <= item && item <= 31);
return new BitSet(this.bits & ~(1 << item));
}
IEnumerator IEnumerable.GetEnumerator() { return this.GetEnumerator(); }
public IEnumerator<int> GetEnumerator()
{
for(int item = 0; item < 32; ++item)
if (this.Contains(item))
yield return item;
}
public override string ToString()
{
return string.Join(",", this);
}
}
Read this code carefully to understand how it works. Again, always remember that adding an element to this set does not change the set. It produces a new set that has the added item.
OK, now that we've got that, let's consider a more efficient algorithm for producing your permutations.
We will solve the problem recursively. A recursive solution always has the same structure:
Can we solve a trivial problem? If so, solve it.
If not, break the problem down into a number of smaller problems and solve each one.
Let's start with the trivial problems.
Suppose you have a set and you wish to choose zero items from it. The answer is clear: there is only one possible permutation with zero elements, and that is the empty set.
Suppose you have a set with n elements in it and you want to choose more than n elements. Clearly there is no solution, not even the empty set.
We have now taken care of the cases where the set is empty or the number of elements chosen is more than the number of elements total, so we must be choosing at least one thing from a set that has at least one thing.
Of the possible permutations, some of them have the first element in them and some of them do not. Find all the ones that have the first element in them and yield them. We do this by recursing to choose one fewer elements on the set that is missing the first element.
The ones that do not have the first element in them we find by enumerating the permutations of the set without the first element.
static class Extensions
{
public static IEnumerable<BitSet> Choose(this BitSet b, int choose)
{
if (choose < 0) throw new InvalidOperationException();
if (choose == 0)
{
// Choosing zero elements from any set gives the empty set.
yield return BitSet.Empty;
}
else if (b.Count() >= choose)
{
// We are choosing at least one element from a set that has
// a first element. Get the first element, and the set
// lacking the first element.
int first = b.First();
BitSet rest = b.Remove(first);
// These are the permutations that contain the first element:
foreach(BitSet r in rest.Choose(choose-1))
yield return r.Add(first);
// These are the permutations that do not contain the first element:
foreach(BitSet r in rest.Choose(choose))
yield return r;
}
}
}
Now we can ask the question that you need the answer to:
class Program
{
static void Main()
{
BitSet b = BitSet.Empty.Add(1).Add(2).Add(3).Add(4).Add(5).Add(6).Add(7);
foreach(BitSet result in b.Choose(3))
Console.WriteLine(result);
}
}
And we're done. We have generated only as many sequences as we actually need. (Though we have done a lot of set operations to get there, but set operations are cheap.) The point here is that understanding how this algorithm works is extremely instructive. Recursive programming on immutable structures is a powerful tool that many professional programmers do not have in their toolbox.

You can do it like this:
var data = Enumerable.Range(1, 7);
var r = from a in data
from b in data
from c in data
where a < b && b < c
select new {a, b, c};
foreach (var x in r) {
Console.WriteLine("{0} {1} {2}", x.a, x.b, x.c);
}
Demo.
Edit: Thanks Eric Lippert for simplifying the answer!

var ints = new int[] { 1, 2, 3, 4, 5, 6, 7 };
var permutations = ints.SelectMany(a => ints.Where(b => (b > a)).
SelectMany(b => ints.Where(c => (c > b)).
Select(c => new { a = a, b = b, c = c })));

Is there a C# equivalent to C++ std::partial_sort?

I'm trying to implement a paging algorithm for a dataset sortable via many criteria. Unfortunately, while some of those criteria can be implemented at the database level, some must be done at the app level (we have to integrate with another data source). We have a paging (actually infinite scroll) requirement and are looking for a way to minimize the pain of sorting the entire dataset at the app level with every paging call.
What is the best way to do a partial sort, only sorting the part of the list that absolutely needs to be sorted? Is there an equivalent to C++'s std::partial_sort function available in the .NET libraries? How should I go about solving this problem?
EDIT: Here's an example of what I'm going for:
Let's say I need to get elements 21-40 of a 1000 element set, according to some sorting criteria. In order to speed up the sort, and since I have to go through the whole dataset every time anyway (this is a web service over HTTP, which is stateless), I don't need the whole dataset ordered. I only need elements 21-40 to be correctly ordered. It is sufficient to create 3 partitions: Elements 1-20, unsorted (but all less than element 21); elements 21-40, sorted; and elements 41-1000, unsorted (but all greater than element 40).

OK. Here's what I would try based on what you said in reply to my comment.
I want to be able to say "4th through 6th" and get something like: 3,
2, 1 (unsorted, but all less than proper 4th element); 4, 5, 6 (sorted
and in the same place they would be for a sorted list); 8, 7, 9
(unsorted, but all greater than proper 6th element).
Lets add 10 to our list to make it easier: 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.
So, what you could do is use the quick select algorithm to find the the ith and kth elements. In your case above i is 4 and k is 6. That will of course return the values 4 and 6. That's going to take two passes through your list. So, so far the runtime is O(2n) = O(n). The next part is easy, of course. We have lower and upper bounds on the data we care about. All we need to do is make another pass through our list looking for any element that is between our upper and lower bounds. If we find such an element we throw it into a new List. Finally, we then sort our List which contains only the ith through kth elements that we care about.
So, I believe the total runtime ends up being O(N) + O((k-i)lg(k-i))
static void Main(string[] args) {
//create an array of 10 million items that are randomly ordered
var list = Enumerable.Range(1, 10000000).OrderBy(x => Guid.NewGuid()).ToList();
var sw = Stopwatch.StartNew();
var slowOrder = list.OrderBy(x => x).Skip(10).Take(10).ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
//Took ~8 seconds on my machine
sw.Restart();
var smallVal = Quickselect(list, 11);
var largeVal = Quickselect(list, 20);
var elements = list.Where(el => el >= smallVal && el <= largeVal).OrderBy(el => el);
Console.WriteLine(sw.ElapsedMilliseconds);
//Took ~1 second on my machine
}
public static T Quickselect<T>(IList<T> list , int k) where T : IComparable {
Random rand = new Random();
int r = rand.Next(0, list.Count);
T pivot = list[r];
List<T> smaller = new List<T>();
List<T> larger = new List<T>();
foreach (T element in list) {
var comparison = element.CompareTo(pivot);
if (comparison == -1) {
smaller.Add(element);
}
else if (comparison == 1) {
larger.Add(element);
}
}
if (k <= smaller.Count) {
return Quickselect(smaller, k);
}
else if (k > list.Count - larger.Count) {
return Quickselect(larger, k - (list.Count - larger.Count));
}
else {
return pivot;
}
}

You can use List<T>.Sort(int, int, IComparer<T>):
inputList.Sort(startIndex, count, Comparer<T>.Default);

Array.Sort() has an overload that accepts index and length arguments that lets you sort a subset of an array. The same exists for List.
You cannot sort an IEnumerable directly, of course.

Sorting two arrays (values,keys), then sorting the keys

I have what seems to be a simple problem but I can't figure it out so far.
Say I have two arrays:
int[] values = {10,20,20,10,30};
int[] keys = {1,2,3,4,5};
Array.Sort(values,keys);
Then the arrays would look like this:
values = {10,10,20,20,30};
keys = {4,1,2,3,5};
Now, what I want to do is make it so that the keys are also sorted in second priority so the key array to look like this:
keys = {1,4,2,3,5};
Notice the 1 and 4 values are switched and the order of the value array has not changed.

If an "in-place sorting" is not strictly necessary for you, I suggest to use OrderBy:
var sortedPairs = values.Select((x, i) => new { Value = x, Key = keys[i] })
.OrderBy(x => x.Value)
.ThenBy(x => x.Key)
.ToArray(); // this avoids sorting 2 times...
int[] sortedValues = sortedPairs.Select(x => x.Value).ToArray();
int[] sortedKeys = sortedPairs.Select(x => x.Key).ToArray();
// Result:
// sortedValues = {10,10,20,20,30};
// sortedKeys = {1,4,2,3,5};

Generally, parallel arrays are frowned upon. It is very easy for the data to become out of sync. What I would suggest is either using a map/Dictionary data type, or storing the keys and values in a single object, and then having an array of said objects.
Edit: after re-reading your question, I dont' think the Dictionary is the data type you want, based on your need to sort the values. I would still suggest having an object that contains the keys and values, however. You can then sort by the values, and rest assured that they keys aren't out of sync.

Array.Sort(values,keys) will use the default Comparer to sort the values and keys. You would need to write a custom Comparer to do what you're describing, and pass your Comparer in to the Array.Sort method.

By converting this to a sort on an array of value pairs you can supply your own comparator and make the sort work pretty much any way you like. (It seems awful risky to use two separate arrays.) See the fourth method at http://msdn.microsoft.com/en-us/library/system.array.sort.aspx.

I think the accepted answer is great. One can use anonymous types, as shown in that answer, or declare a named type to hold the data while sorting.
Even better, declare a named type to hold the data all the time. Parallel arrays are usually not a good idea. There are some niche scenarios where they are needed for performance or interopability reasons, but otherwise they should be avoided.
That said, for completeness I think it would be useful to also point out that the arrays can be sorted "by proxy". I.e. create a new array that is just the indexes of the original arrays and sort that array. Once the index array has been sorted, you can use that array to access the original data directly, or you can use that array to then copy the original data into new, sorted arrays.
For example:
static void Main(string[] args)
{
int[] values = { 10, 20, 20, 10, 30 };
int[] keys = { 1, 2, 3, 4, 5 };
int[] indexes = Enumerable.Range(0, values.Length).ToArray();
Array.Sort(indexes, (i1, i2) => Compare(i1, i2, values, keys));
// Use the index array directly to access the original data
for (int i = 0; i < values.Length; i++)
{
Console.WriteLine("{0}: {1}", values[indexes[i]], keys[indexes[i]]);
}
Console.WriteLine();
// Or go ahead and copy the old data into new arrays using the new order
values = OrderArray(values, indexes);
keys = OrderArray(keys, indexes);
for (int i = 0; i < values.Length; i++)
{
Console.WriteLine("{0}: {1}", values[i], keys[i]);
}
}
private static int Compare(int i1, int i2, int[] values, int[] keys)
{
int result = values[i1].CompareTo(values[i2]);
if (result == 0)
{
result = keys[i1].CompareTo(keys[i2]);
}
return result;
}
private static int[] OrderArray(int[] values, int[] indexes)
{
int[] result = new int[values.Length];
for (int i = 0; i < values.Length; i++)
{
result[i] = values[indexes[i]];
}
return result;
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

What is fastest way to find number of matches between arrays? - c#

Use a HashSet var set = new HashSet<int>(firstArray); set.IntersectWith(secondArray); The set now contains only the values that exist in both arrays.

Related

How to achieve O(n) worst-case time complexity for this function?

Iterating array - only array cells with values

Get all possible distinct triples using LINQ

Is there a C# equivalent to C++ std::partial_sort?

Sorting two arrays (values,keys), then sorting the keys

Categories

Resources