Optimizing C# loop comparison of two very large lists

Optimizing C# loop comparison of two very large lists - c#

Before you read my explanation I want to tell you that I need to optimize processing time for comparing two huge c# lists, index by index in a nested loop.
Its a .Net Core App which I am creating with C# of course.
In my algorithm I have to create a very long list with some ranges of integers, like this.
internal class Global
{
public string ChromosomeName { get; set; }
public int start { get; set; }
public int end { get; set; }
public string Cluster { get; set; }
public string Data { get; set; }
}
var globals = new List<Global>();// somewhere in my method.
now this list will be very huge for example it will have values stored like this. This is my main list so its named 'globals'
index 0 = start=1, end=400 ....
index 1 = start=401, end=800....
index (last) = start= 45090000 , end= 45090400 ...
These are just rough estimate values so that you understand that it's going to be a huge list.
Now in my algorithm what I actually have to do is
So I take one text file, read that file and store its data in another list exactly with the same properties as shown above in the code.
Now I have 2 lists, globals list and other list which i read from the file.
Both of them are very huge lists
Now I have to compare both of them index by index in a nested loop.
Outer loop will be iterating my globals list and inner loop will be iterating my other list ( which i read from the file).
After I finish the nested loops one time, I read another file and created another list and then compare that list with globals list in same manner..
So there will be one global list which will be compared index by index in a nested loop with around 10 more lists and all of them being nearly as huge as global list itself.
Below is pseudocode shown for the nested foreach loops.
foreach(var item in globals)
{
var value=0;
foreach(var item2 in otherHugeList)
{
compareMethod(item,item2);
//below is the actual code of wht kind of comparison I am doing, just if i guyx want to know, I am actually finding the overlap between two ranges.
//value += Math.Max(0, Math.Min(range1.end, EndList[i]) - Math.Max(range1.start, StartList[i]) + 1);
}
}
What is the fastest way I can do this, because right now it takes more than hours and I get frustrated and I cancel the process because I don't know how long its going to take. So I am not even able to get my results on smaller files.
I need to know the fastest possible way to do this, should I use any library compatible with .Net core? or multithreading somehow? I am not that good with threading concepts though.
P.S: I have used Parallel.ForEach and its difference on performance is negligible.

If you need to make element-by-element comparisons of two lists with 106 items each, there's 1012 comparisons that you need to make. It leaves you no hope to finish in a sane amount of time, so the key to solving this problem is to drastically reduce the number of comparisons.
The exact approach to making the reduction depends on the kind of comparison that you are running, so let's use overlap computation from your post as an example.
You know that there is no overlap between ranges R and Q when one of the statements below is true:
Upper bound of R is below the lower bound of Q, or
Lower bound of R is above the upper bound of Q.
This wouldn't help if your ranges appear on the list in random order. However, if you sort your ranges on the lower bound, and resolve ties by the upper bound, you will be able to use binary search to find the relevant portion of the list for each range you compare, i.e. the elements for which the overlap is possible.
Assuming that there is little overlap among ranges on the same list, this will reduce the number of comparisons from roughly a million per element to well under a hundred per element, resulting in 1000-fold increase in performance.
None of my lists will have self-overlapping ranges (comment)
Then you can use a variation of the merge algorithm by sorting both range lists, and then iterating them in a single loop. Set indexes into two arrays to zero, then walk both lists one step at a time. If the current range on the global list is below the start level of the current range on the comparison list, move on to the next element of the global list; otherwise, move on to the next element of the comparison list. The two indexes will "chase" each other until you reach the end of both lists after 2M increments.

Related

C#( C++ would be cool too) Fastest way to find differences in two large arrays/lists with indexes

More Details:
For this problem, I'm specifically looking for the fastest way to do this, in general and specifically in c#. I don't necessarily mean "theoretical" fastest/algorithmic, instead I'm looking for practical implementation speed. In this specific situation, the arrays only have like 1000 elements each, which seems very small, but this computation is going to be running very rapidly and comparing many arrays(it blows up in size very quickly). I ultimately need the indexes of each element that is different.
I can obviously do a very simple implementation like:
public List<int> FindDifferences(List<double> Original,List<double> NewList)
{
List<int> Changes = new List<int>();
for(int i=0;i<Original.Count;i++)
{
if(Original[i]!=NewList[i])
{
Changes.Add(i);
}
}
return Changes;
}
But from what I can see, this will be really slow overall since it has to iterate once though each item on the list. Is there anything I can do to speed this up? Specifically, is there a way to do something like a parallel foreach that generates a list of the indexes of changes? I saw what I think was a similar question asked before, but I didn't quite understand the answer .Or would there be another way to run the calculation on all items of the list simultaneously(or somehow clustered)?
Assumptions
Each array or list being compared contains data of the same
type(double int or string), so if array1 holds strings and is
compared to array2, I know for certain that array2 will only hold
strings and it will be of the same size(in terms of item count-I can
see if maybe they are the same byte count too if that could come
into play).
The vast majority of the items in these comparisons will remain the same. My resultant "differences" list will probably only contain a few(1-10) items, if any.
Concerns
1) After a comparison is made(old and new list in the block above), the new list will overwrite the old list. If computation time is slower than the time it takes to receive a new message(a new list to compare), I can have a problem with collision:
Lets say I have three lists, A,B, and C. A would be my global/"current state" list. When a message is received containing a new list(B), it would be the list B would be compared to.
In an ideal world, A would be compared to B, I would receive a list of integers representing the indexes that contain elements different between the two. After the method computes and returns this index list, A would become B(the values of B overwrite the values of A as my "current state"). When I receive another message(C), this would be compared to my new current state(A, but with the values previously belonging to B), I'd receive the list of differences and C's values would overwrite A's and become the new current state. If the comparison between A and B is still calculating when C is received, I would need to make sure the new calculation either:
Doesn't happen until after A and B's comparison finish and A is overwritten with its new values. or
The comparison is instead made between B and C, with C overwriting A after the comparison finishes(the difference list is fired off elsewhere, so I'd still receive both change lists)
2) If this comparison between lists can't be sped up, is there somewhere else I can speed up instead? These messages I'm receiving come as an object with three values, an Ascii-encoded byte array, a long string(the already parsed byte array), and a "type"(the name of the list it corresponds to-so I know the data type of its contents). I currently ignore the byte array and parse the string by splitting it at newline characters.
I know this is inefficient, but I have trouble converting the byte array into ints or doubles. The doubles because it has a lot of "noise"(a value of 1.50 could end up coming in as 1.4976789, so I actually have to round it to get its "real" value). The ints because there is no 0 padding, so I don't know the length to chunk the byte array into. Below is an example of what I'm doing:
public List<string> ListFromString(string request)
{
List<string> fulllist = request.Split('\n').ToList<string>();
return fulllist.GetRange(1, fulllist.Count - 2); //There's always a label tacked on the beginning so I start from 1
}
public List<double> RequestListAsDouble(string request)
{
List<string> RequestAsString = ListFromString(request);
List<double> RequestListAsDouble = new List<double>();
foreach(string requestElement in RequestAsString)
{
double requestElementAsDouble = Math.Round(Double.Parse(requestElement),2);
RequestListAsDouble.Add(requestElementAsDouble);
}
return RequestListAsDouble;
}

Your single-threaded comparison of the two parsed lists is probably the fastest way to do it. It is certainly the easiest. As noted by another poster, you can get some speed advantage by pre-allocating the size of the "Changes" list to be some percentage of the size of your input list.
If you want to try parallel thread comparisons, you should setup "N" number of threads in advance and have them wait for a starting event. "N" is the number of real processors on your system. Each thread should compare a portion of the lists, and write their answers to the interlocked output list "Changes". On completion, the threads go back to sleep, waiting for the next starting event.
When all the threads have gone back to their starting positions, the main thread can pick up the "Changes" and pass it along. Repeat with the next list
Be sure to clean up all the worker threads when your application is supposed to exit - or it won't exit.
There is a lot of overhead in starting and ending threads. It is all too easy to lose all the processing speed from that overhead. That's why you would want a pool of worker threads already setup and waiting on an event flag. Threads only improve processing speed up to the number of real CPUs in the system.

A small optimization would be to initialize the results list with the capacity of the original
https://msdn.microsoft.com/en-us/library/4kf43ys3(v=vs.110).aspx
If the size of the collection can be estimated, using the
List(Int32) constructor and specifying the initial capacity
eliminates the need to perform a number of resizing operations while
adding elements to the List.
List<int> Changes = new List<int>(Original.Length);

C# List - A more efficient multiple at once insertRange through shifts in lists

I have a list that I divided into a fixed number of sections (with some that might be empty).
Every section contains unordered elements, however the sections themselves are ordered backwards.
I am referencing each beginning of a section through a fixed dimension array, whose elements are the indexes at which the each section can be found in the list.
I regularly extract the whole section at the tail of the list and, when I do so, I set its index inside the array at 0 (so the section will start to regrow from the head of the list) and then I circularly increment the lastSection variable that I use to keep track of which section is at the tail of the list.
With the same frequency I also need to insert back into the list some new elements that will be spread across one or more sections.
I chose a single sectioned list (instead of a list of lists or something like that) because, even if the sections may vary a lot (from empty to a length of some thousands), the total number of elements has little variations during the application runtime AND because I also frequently need to get all the elements in the list, and didn't want to concatenate multiple lists in order to get the result.
Graphical representation of the data structure
Existential question:
Up to here did I do some mistakes in the choice of the data structure, since these described are all the operations I am doing with it?
Going forward:
The problem I am trying to address, since this is the core of the application I am building (and I want to squeeze out every slice of performance I can since it should run on smartphones), is: how can I do those multiple inserts as fast as possible?
Trivial solution:
For each new group of elements belonging to a certain section, just do an insertRange (sectionBeginning, groupOfElements).
Performance footprint:
every insertRange will force the list to shift all the content after the root of a section to the right, and with multiple insertRange this means that some data will be shifted even M times, where M is the number of insertRange done with index != list.Count.
Little smarter solution:
Knowing before every multiple-inserts step which and how many new elements per section I need to add, I can add empty elements to the back of the list, perform M shifts of determined size, then copy the new elements to the corresponding "holes" left inside the list.
I could extend the list class and implement a new insertRange (int [] index, IEnumerable [] collection) where each index points to the beginning of a section, however I am worried about some possible internal optimizations that the list class might have and that could transform my for loop shifts in worse performance, like an Array.Copy to which I do not think to have access. Is there a way to do a performant list shift in order to implement this and gain some advantages over multiple standard insertRanges?
Note: index and collections should be ordered by section.
Graphical representation of the multiple-at once insertRange approach
Another similar thread about insertRange:
Replace multiple InsertRange() into efficient way
Another similar thread about shifts in lists:
Does code exist, for shifting List elements to left or right by specified amount, in C#?

geting a new sorted list from a list, i only need x items and dont want to sort the whole list before getting the top items

I have a class
public class Entity : IComparable<Entity>
{
public float Priority { get; set; }
{
I create a list and fill it with Y items that are in no particular order
list <Entity> = get_unorderd_list();
now i want to sort the list according to the Priority value , but I only care about getting the highest X items in the right order, for performance reasons I don't want to use a regular .sort() method as X is a lot smaller then Y.
Should I write a custom sorting method? Or is there a way to do this?
edit:not talking about geting a single velue by using .max()

Should I write a custom sorting method?
I don't know of a way of doing it easily from .NET itself. When I implemented sorting for Edulinq, I took this sort of approach - the whole ordering is a quick-sort, but I only sort as much as I need to in order to return the results so far.
Now, that's still a general approach - if you know how many results you need beforehand, you can probably do a lot better. You might want to build a heap-based solution, where you have a bounded heap (max size X) and then iterate over your input, adding values into your heap as you go. As soon as you've filled up the first X elements, you can start discarding new ones which are smaller than your smallest heap element, without even looking at the rest of the tree. For other elements, you perform the normal operations to insert the element, maintaining the ordering within the heap.
See the Wikipedia article on binary heaps stored as arrays for more information about how you might structure your heap.

The trouble is the last item in your unsorted list could be the highest item in X so you will have to iterate over the whole of Y i should think.

if you mean MAX simply use linq Max to find highest priority item.you can not write more efficient method than Max,cause you must compare all items in the list to find max Anyway.
var highestPeriorityItem = unorderd_list.Max(x=>x.Periority);
EDIT:
there is another way more efficient than this , that is keeping list sorted from very beginning.that mean you must keep list sorted with each insert(insert item in sorted order). this way time complexity of finding max item is O(1) and time complexity of inserting new item is O(1) < x < O(N).
hope this helps.

Efficiently pairing objects in lists based on key

So, here's the deal.
(My current use-case is in C#, but I'm also interested in the general algorithmic case)
I am given two Arrays of objects (I don't get to alter the code that creates these arrays, unfortunately).
Each object has (as part of it) a .Name property, a string.
These strings are unique per object, and they have zero or one matching strings in the other object.
What I need to do is efficiently pair these objects based on that string, into some sort of collection that allows me access to the paired objects. The strings need to match exactly to be considered a match, so I don't need any Upper or CaseInsensitive, etc.
Sadly, these lists are not sorted.
The lists themselves are maybe 30-50 items, but I need to repeat the algorithm on thousands of these array-pairs in a row, so efficiency is important.
Since I know that there's 0 or 1 match, and I know that most of them will be 1 match, I feel like there's a more efficient algorithm than x*y (Foreach item in x, foreach item in y, if x=y then x and y are a match)
I believe the most likely options are:
Keep the unsorted list and just do x*y, but drop items from the list once I've found them so I don't check ones already-found,
OR:
Convert both to Dictionaries and then do an indexed lookup on each (array2[currentArray1Item])
OR:
Sort the lists myself (Array.Sort()), and then having sorted arrays I can probably do something clever like jump to the index in B where I'd expect to find it (wherever it was in A) and then move up or down based on string until I either find it or pass where it should've been.
Then once that's done I need to figure out how to store it, I suppose I can make a custom ObjectPair class that just holds objects A and B. No need to do anything fancy here, since I'm just going to ForEach on the pairs.
So the questions are:
Are any of the above algorithms the fastest way to do this (if not, what is?) and is there some existing C# structure that'd conveniently hold the found pairs?
EDIT: Array.Sort() is a method that exists, so I don't need to convert the array to List to sort. Good to know. Updated above.

The question I have is: how much efficiency do we gain from the special handling if it requires us to sort both input arrays? According to the documentation for Array.Sort, it is O(n log n) on average and O(n ^ 2) in the worst case (quicksort). Once we have both arrays sorted, we then have another O(n) amount of work because we have to loop through the first one.
I interpret this to mean that the overall amount of work might actually increase because of the number of iterations required to sort, then process. This of course would be a different story if you could guarantee sorted arrays at the start, but as you said you cannot. (I should also note that you would need to create a custom IComparer<T> implementation to pass to Array.Sort so it knows to use the .Name property. That's not runtime work, but it's still work :-)
You might consider using a LINQ join, which only iterates the inner array a single time (see here for psuedocode). This is as opposed to the nested foreach statements, which would iterate the inner array for each element of the outer array. It's about as efficient as it can be in the general case and doesn't introduce the complexity of the special handling you suggested.
Here is an example implementation:
var pairs =
from item1 in array1
join item2 in array2 on item1.Name equals item2.Name
select new { item1, item2 };
foreach(var pair in pairs)
{
// Use the pair somehow
}
That very clearly states what you are doing with the data and also gives you an anonymous type representing each pair (so you don't have to invent a pairing). If you do end up going a different route, I would be interested in how it compares to this approach.

Sort the second array using Array.Sort method, then match objects in the second Array using Binary Search Algorithm.
Generally, for 30-50 items this would be a little faster than brute force x*y.

How can I sort an array of strings?

I have a list of input words separated by comma. I want to sort these words by alphabetical and length. How can I do this without using the built-in sorting functions?

Good question!! Sorting is probably the most important concept to learn as an up-and-coming computer scientist.
There are actually lots of different algorithms for sorting a list.
When you break all of those algorithms down, the most fundamental operation is the comparison of two items in the list, defining their "natural order".
For example, in order to sort a list of integers, I'd need a function that tells me, given any two integers X and Y whether X is less than, equal to, or greater than Y.
For your strings, you'll need the same thing: a function that tells you which of the strings has the "lesser" or "greater" value, or whether they're equal.
Traditionally, these "comparator" functions look something like this:
int CompareStrings(String a, String b) {
if (a < b)
return -1;
else if (a > b)
return 1;
else
return 0;
}
I've left out some of the details (like, how do you compute whether a is less than or greater than b? clue: iterate through the characters), but that's the basic skeleton of any comparison function. It returns a value less than zero if the first element is smaller and a value greater than zero if the first element is greater, returning zero if the elements have equal value.
But what does that have to do with sorting?
A sort routing will call that function for pairs of elements in your list, using the result of the function to figure out how to rearrange the items into a sorted list. The comparison function defines the "natural order", and the "sorting algorithm" defines the logic for calling and responding to the results of the comparison function.
Each algorithm is like a big-picture strategy for guaranteeing that ANY input will be correctly sorted. Here are a few of the algorithms that you'll probably want to know about:
Bubble Sort:
Iterate through the list, calling the comparison function for all adjacent pairs of elements. Whenever you get a result greater than zero (meaning that the first element is larger than the second one), swap the two values. Then move on to the next pair. When you get to the end of the list, if you didn't have to swap ANY pairs, then congratulations, the list is sorted! If you DID have to perform any swaps, go back to the beginning and start over. Repeat this process until there are no more swaps.
NOTE: this is usually not a very efficient way to sort a list, because in the worst cases, it might require you to scan the whole list as many as N times, for a list with N elements.
Merge Sort:
This is one of the most popular divide-and-conquer algorithms for sorting a list. The basic idea is that, if you have two already-sorted lists, it's easy to merge them. Just start from the beginning of each list and remove the first element of whichever list has the smallest starting value. Repeat this process until you've consumed all the items from both lists, and then you're done!
1 4 8 10
2 5 7 9
------------ becomes ------------>
1 2 4 5 7 8 9 10
But what if you don't have two sorted lists? What if you have just one list, and its elements are in random order?
That's the clever thing about merge sort. You can break any single list into smaller pieces, each of which is either an unsorted list, a sorted list, or a single element (which, if you thing about it, is actually a sorted list, with length = 1).
So the first step in a merge sort algorithm is to divide your overall list into smaller and smaller sub lists, At the tiniest levels (where each list only has one or two elements), they're very easy to sort. And once sorted, it's easy to merge any two adjacent sorted lists into a larger sorted list containing all the elements of the two sub lists.
NOTE: This algorithm is much better than the bubble sort method, described above, in terms of its worst-case-scenario efficiency. I won't go into a detailed explanation (which involves some fairly trivial math, but would take some time to explain), but the quick reason for the increased efficiency is that this algorithm breaks its problem into ideal-sized chunks and then merges the results of those chunks. The bubble sort algorithm tackles the whole thing at once, so it doesn't get the benefit of "divide-and-conquer".
Those are just two algorithms for sorting a list, but there are a lot of other interesting techniques, each with its own advantages and disadvantages: Quick Sort, Radix Sort, Selection Sort, Heap Sort, Shell Sort, and Bucket Sort.
The internet is overflowing with interesting information about sorting. Here's a good place to start:
http://en.wikipedia.org/wiki/Sorting_algorithms

Create a console application and paste this into the Program.cs as the body of the class.
public static void Main(string[] args)
{
string [] strList = "a,b,c,d,e,f,a,a,b".Split(new [] { ',' }, StringSplitOptions.RemoveEmptyEntries);
foreach(string s in strList.Sort())
Console.WriteLine(s);
}
public static string [] Sort(this string [] strList)
{
return strList.OrderBy(i => i).ToArray();
}
Notice that I do use a built in method, OrderBy. As other answers point out there are many different sort algorithms you could implement there and I think my code snippet does everything for you except the actual sort algorithm.
Some C# specific sorting tutorials

There is an entire area of study built around sorting algorithms. You may want to choose a simple one and implement it.
Though it won't be the most performant, it shouldn't take you too long to implement a bubble sort.

If you don't want to use build-in-functions, you have to create one by your self. I would recommend Bubble sort or some similar algorithm. Bubble sort is not an effective algoritm, but it get the works done, and is easy to understand.
You will find much good reading on wikipedia.

I would recommend doing a wiki for quicksort.
Still not sure why you don't want to use the built in sort?

Bubble sort damages the brain.
Insertion sort is at least as simple to understand and code, and is actually useful in practice (for very small data sets, and nearly-sorted data). It works like this:
Suppose that the first n items are already in order (you can start with n = 1, since obviously one thing on its own is "in the correct order").
Take the (n+1)th item in your array. Call this the "pivot". Starting with the nth item and working down:
- if it is bigger than the pivot, move it one space to the right (to create a "gap" to the left of it).
- otherwise, leave it in place, put the "pivot" one space to the right of it (that is, in the "gap" if you moved anything, or where it started if you moved nothing), and stop.
Now the first n+1 items in the array are in order, because the pivot is to the right of everything smaller than it, and to the left of everything bigger than it. Since you started with n items in order, that's progress.
Repeat, with n increasing by 1 at each step, until you've processed the whole list.
This corresponds to one way that you might physically put a series of folders into a filing cabinet in order: put one in; then put another one into its correct position by pushing everything that belongs after it over by one space to make room; repeat until finished. Nobody ever sorts physical objects by bubble sort, so it's a mystery to me why it's considered "simple".
All that's left now is that you need to be able to work out, given two strings, whether the first is greater than the second. I'm not quite sure what you mean by "alphabetical and length" : alphabetical order is done by comparing one character at a time from each string. If there not the same, that's your order. If they are the same, look at the next one, unless you're out of characters in one of the strings, in which case that's the one that's "smaller".

Use NSort
I ran across the NSort library a couple of years ago in the book Windows Developer Power Tools. The NSort library implements a number of sorting algorithms. The main advantage to using something like NSort over writing your own sorting is that is is already tested and optimized.

Posting link to fast string sort code in C#:
http://www.codeproject.com/KB/cs/fast_string_sort.aspx
Another point:
The suggested comparator above is not recommended for non-English languages:
int CompareStrings(String a, String b) {
if (a < b) return -1;
else if (a > b)
return 1; else
return 0; }
Checkout this link for non-English language sort:
http://msdn.microsoft.com/en-us/goglobal/bb688122
And as mentioned, use nsort for really gigantic arrays that don't fit in memory.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.