I'm searching through a generic list (or IQueryable) which contains 3 columns. I'm trying to find the value of the 3 column, based on 1 and 2, but the search is really slow. For a single search, the speed isn't noticeable, but I'm performing this search on a loop, and for 700 iterations, it takes a combined time of over 2 minutes, which isn't any use. Columns 1 and 2 are int and column 3 is a double. Here is the linq I'm using:
public static Distance FindByStartAndEnd(int start, int end, IQueryable<Distance> distanceList)
{
Distance item = distanceList.Where(h => h.Start == start && h.End == end).FirstOrDefault();
return item ;
}
There could be up do 60,000 entries in the IQueryable list. I know that is quite a lot, but I didn't think it would pose any problem for searching.
So my question is, is there a better way to search through a collection when needing to match 2 columns to get value of a third? I guess I need all 700 searches to be almost instant, but it takes about 300ms for each which soon mounts up.
UPDATE - Final Solution #######################
I've now created a dictionary using Tuple with start and end as the key. I think this could be the right solution.
var dictionary = new Dictionary<Tuple<int, int>, double>();
var key = new Tuple<int, int>(Convert.ToInt32(reader[0]), Convert.ToInt32(reader[1]));
var value = Convert.ToDouble(reader[2]);
if (value <= distance)
{
dictionary.Add(key, value);
}
var key = new Tuple<int, int>(5, 20);
Works fine - much faster
Create a dictionary where columns 1 and 2 create the key. You create the dictionary once and then your searches will be almost instant.
If you have control over your collection and model classes, there is a library which allows you to index the properties of the class, which can greatly speed up searching.
http://i4o.codeplex.com/
I'd give a hashSet a try. This should speed up things ;)
Create a single value out of the first two columns, for example by concatenating them into a long, and use that as a key in a dictionary:
public long Combine(int start, int end) {
return ((long)start << 32) | end;
}
Dictionary<long, Distance> lookup = distanceList.ToDictionary(h => Combine(h.Start, h.End));
Then you can look up the value:
public static Distance FindByStartAndEnd(int start, int end, IQueryable<Distance> distanceList) {
Distance item;
if (!lookup.TryGetValue(Combine(start, end), out item) {
item = null;
}
return item;
}
Getting an item from a dictionary is close to an O(1) operaton, which should make a dramatic difference from the O(n) operaton to loop through the items to find one.
Your problem is that LINQ has to execute the expression tree everytime you return the item. Just call this method with multiple start and end values
public static IEnumerable<Distance> FindByStartAndEnd
(IEnumerable<KeyValuePair<int, int>> startAndEnd,
IQueryable<Distance> distanceList)
{
return
from item in distanceList
where
startAndEnd.Select(s => s.Key).Contains(item.Start)
&& startAndEnd.Select(s => s.Value).Contains(item.End)
select item;
}
Related
Say i have a list that hold minitues of film durations called
filmDurations in type of int.
And i have a int parameter called flightDuration for a duration
of any given flight in minitues.
My objective is :
For any given flightDuration, i want to match 2 film from my filmDurations that their sums exactly finishes 30 minutes from flight.
For example :
filmDurations = {130,105,125,140,120}
flightDuration = 280
My output : (130 120)
I can do it with nested loops. But it is not effective and it is time consuming.
I want to do it more effectively.
I thinked using Linq but still it is O(n^2).
What is the best effective way?
Edit: I want to clear one thing.
I want to find filmDurations[i] + filmDurations[j] in;
filmDurations[i] + filmDurations[j] == fligtDuration - 30
And say i have very big amont of film durations.
You could sort all durations (remove duplicates) (O(n log n)) and than iterate through them (until the length flight-duration -30). Search for the corresponding length of the second film (O(log n)).
This way you get all duration-pairs in O(n log n).
You can also use a HashMap (duration -> Films) to find matching pairs.
This way you can avoid sorting and binary search. Iterate through all durations and look up in the map if there are entries with duration = (flight-duration -30).
Filling the map needs O(n) lookup O(1) and you need to iterate all durations.
-> Over all complexity O(n) but you loose the possibility to find 'nearly matching pairs which would be easy to implement using the sorted list approach described above)
As Leisen Chang said you can put all items into dictionary. After doing that rewrite your equation
filmDurations[i] + filmDurations[j] == fligtDuration - 30
as
filmDurations[i] == (fligtDuration - 30 - filmDurations[j])
Now for each item in filmDurations search for (fligtDuration - 30 - filmDurations[j]) in dictionary. And if such item found you have a solution.
Next code implement this concept
public class IndicesSearch
{
private readonly List<int> filmDurations;
private readonly Dictionary<int, int> valuesAndIndices;
public IndicesSearch(List<int> filmDurations)
{
this.filmDurations = filmDurations;
// preprocessing O(n)
valuesAndIndices = filmDurations
.Select((v, i) => new {value = v, index = i})
.ToDictionary(k => k.value, v => v.index);
}
public (int, int) FindIndices(
int flightDuration,
int diff = 30)
{
// search, also O(n)
for (var i = 0; i < filmDurations.Count; ++i)
{
var filmDuration = filmDurations[i];
var toFind = flightDuration - filmDuration - diff;
if (valuesAndIndices.TryGetValue(toFind, out var j))
return (i, j);
}
// no solution found
return (-1, -1); // or throw exception
}
}
I have Dictionary<string,T> where string represents the key of record, and I have two other pieces of information about the record that I need to maintain for each record in the dictionary, which are the category of the record and its redundancy (how many times its repeated).
For example: the record XYZ1 is of category 1, and its repeated 1 times. therefore the implementation has to be something like this:
"XYZ1", {1,1}
Now moving on, I may encounter the same record in my dataset, therefore the value of the key has to be updated like:
"XYZ1", {1,2}
"XYZ1", {1,3}
...
Since I am processing big number of records such as 100K, I tried this approach but it seems inefficient because the extra effort of fetching the value from dictionary and then slicing {1,1} and then converting both slices into integer puts lot of overhead on the execution.
I was thinking of using binary digits to represent both category and repatation and maybe bitmask to fetch these pieces.
Edit: I tried to use object with 2 properties, and then Tuple<int,int>. Complexity got worse !
My question: is it possible to do so ?
if not (in terms of complexity) any suggestions?
What is your type T? You could define a custom type which holds the information you need (category and occurences) .
class MyInfo {
public int c { get; set; }
public int o { get; set; }
}
Dictionary<String, MyInfo> data;
Then when traversing your data you can easily check whether some key is already present. If yes, just increment the occurences, else insert a new element.
MyInfo d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data.Add(e.key, new MyInfo { c = e.cat, o= 1});
else
d.o++;
}
EDIT
You could also combine the category and the number of occurences into one UInt64. For instance take the category in the higher 32 bit (ie you can have 4 billion categories) and the number of occurenes in the lower 32 bit (ie each key can occur 4 billion times)
Dictionary<string, UInt64> data;
UInt64 d;
foreach (var e in elements) {
if (!data.TryGet(e.key, out d))
data[e.key] = (e.cat << 32) + 1;
else
data[e.key] = d + 1;
}
And if you want to get the number of occurrences for one specific key you can just inspect the respective part of the value.
var d = data["somekey"];
var occurrences = d & 0xFFFFFFFF;
var category = d >> 32;
It seems like category never changes. So rather than using a simple string for the key of your dictionary, I would instead do something like:
Dictionary<Tuple<string,int>,int> where the key of the dictionary is a Tuple<string,int> where the string is the record and the int is the category. Then the value in the dictionary is just a count.
A dictionary is probably going to be the fastest data structure for what you're trying to accomplish as it has near constant time O(1) lookup and entry.
You can speed it up a little bit by using the Tuple, as now the category is part of the key and no longer a bit of information you have to access separately.
At the same time you could also keep the string as the key and store a Tuple<int,int> as the value and simply set Item1 as the category and Item2 as the count.
Either way is going to be roughly equivalent in speed. Processing 100k records in such a manner should be pretty fast either way.
I'm storing some data in a Math.net vector, as I have to do some calculations with it as a whole. This data comes with a time information when it was collected. So for example:
Initial = 5, Time 2 = 7, Time 3 = 8, Time 4 = 10
So when I store the data in a Vector it looks like this.
stateVectorData = [5,7,8,10]
Now sometimes I need to extract a single entry of the vector. But I don't have the index itself, but a time Information. So what I try is a dictionary with the information of the time and the index of the data in my stateVector.
Dictionary<int, int> stateDictionary = new Dictionary<int, int>(); //Dict(Time, index)
Everytime I get new data I add an entry to the dictionary(and of course to the stateVector). So at Time 2 I did:
stateDictionary.Add(2,1);
Now this works as long as I don't change my vector. Unfortunately I have to delete an entry in the vector when it gets too old. Assume time 2 is too old I delete the second entry and have a resulting vector of:
stateVector = [5,8,10]
Now my dictionary has the wrong index values stored.
I can think of two possible solutions how to solve this.
To loop through the dictionary and decrease every value (with key > 2) by 1.
What I think would be more elegant, is storing a reference to an vectorentry in the dictionary instead of the index.
So something like
Dictionary<int, ref int> stateDictionary =
new Dictionary<int, ref int>(); //Dict(Time, reference to vectorentry)
stateDictionary.Add(2, ref stateVector[1]);
Using something like this, I wouldn't care about deleting some entrys in the vector, as I still have the reference to the rest of the vectorentries. Now I know it's not possible to store a reference in C#.
So my question is, is there any alternative to looping through the whole dictionary? Or is there another solution without a dictionary I don't see at the moment?
Edit to answer juharr:
Time information doesn't always increase by one. Depends on some parallel running process and how long it takes. Probably increasing between 1 to 3. But also could be more.
There are some values in the vector which never get deleted. I tried to show this with the initial value of 5 which stays in the vector.
Edit 2:
Vector stores at least 5000 to 6000 elements. Maximum is not defined at the moment, as it is restricted by the elements I can handle in real time, so in my case I have about 0.01s to do my further calculations. This is why I search an effective way, so I can increase the number of elements in the vector (or increase the maximum "age" of my vectorentries).
I need the whole vector for calculation about 3 times the number I need to add a value.
I have to delete an entry with the lowest frequency. And finding a single value by its time key will be the most often case. Maybe 30 to 100 times a second.
I know this all sounds very undefined, but the frequency of finding and deleting part depends on an other process, which can vary a lot.
Though hope you can help me. Thanks so far.
Edit 3:
#Robinson
The exact number of times I need the whole vector also depends on the parallel process. Minimum would be two times every iteration (so twice in 0.01s), maximum at least 4 to 6 times every iteration.
Again, the size of the vector is what I want to maximize. So assumed to be very big.
Edit Solution:
First thanks to all, who helped me.
After experimenting a bit, I'm using the following construction.
I'm using a List, where I save the indexes in my state vector.
Additionally I use a Dictionary to assign my Time-key to the List Entry.
So when I delete something in the state vector, I loop only over the List, which seems to be much faster than looping the dictionary.
So it is:
stateVectorData = [5,7,8,10]
IndexList = [1,2,3];
stateDictionary = { Time 2, indexInList = 0; Time 3, indexInList = 1; Time 4, indexInList = 2 }
TimeKey->stateDictionary->indexInList -> IndexList -> indexInStateVector -> data
You can try this:
public class Vector
{
private List<int> _timeElements = new List<int>();
public Vector(int[] times)
{
Add(times);
}
public void Add(int time)
{
_timeElements.Add(time);
}
public void Add(int[] times)
{
_timeElements.AddRange(time);
}
public void Remove(int time)
{
_timeElements.Remove(time);
if (OnRemove != null)
OnRemove(this, time);
}
public List<int> Elements { get { return _timeElements; } }
public event Action<Vector, int> OnRemove;
}
public class Vectors
{
private Dictionary<int, List<Vector>> _timeIndex;
public Vectors(int maxTimeSize)
{
_timeIndex = new Dictionary<int, List<Vector>>(maxTimeSize);
for (var i = 0; i < maxTimeSize; i++)
_timeIndex.Add(i, new List<Vector>());
List = new List<Vector>();
}
public List<Vector> FindVectorsByTime(int time)
{
return _timeIndex[time];
}
public List<Vector> List { get; private set; }
public void Add(Vector vector)
{
List.Add(vector);
vector.Elements.ForEach(element => _timeIndex[element].Add(vector));
vector.OnRemove += OnRemove;
}
private void OnRemove(Vector vector, int time)
{
_timeIndex[time].Remove(vector);
}
}
To use:
var vectors = new Vectors(maxTimeSize: 6000);
var vector1 = new Vector(new[] { 5, 30, 8, 20 });
var vector2 = new Vector(new[] { 25, 5, 23, 11 });
vectors.Add(vector1);
vectors.Add(vector2);
var findsTwo = vectors.FindVectors(time: 5);
vector1.Remove(time: 5);
var findsOne = vectors.FindVectors(time: 5);
The same can be done for adding times, also the code is just for illustration purposes.
Let me explain the situation first:
I receive a value from my Binary Search on a collection, and quickly jump to that to do some coding. Next I want to jump to the next item in the list. But this next item is not exactly the one that follows it could be 3 or 4 items later. Here is my data to understand the sitatuion
Time ID
0604 ABCDE
0604 EFGH
0604 IJKL
0626 Some Data1
0626 Some Data2
0626 Some Data3
0626 Some Data4
Let's say Binary search return's index 0, I jump to index 0 (0604 ABCDE). I process/consume all 0604. Now I am at index 0, how do I jump to index 3 (0626) and consume / process all of it. Keeping in mind this will not always be the same. Data can be different. So I can't simply jump : index + 3
Here's my code:
var matches = recordList.Where(d => d.DateDetails == oldPointer);
var lookup = matches.ToLookup(d => d.DateDetails).First();
tempList = lookup.ToList();// build templist
oldPointer here is the index I get from Binary search. I take this up and build a templist. Now after this I want to jump to 0626.
How many records with the same "old pointer" do you typically expect? Is usually going to be less than 100? if so: don't over-complicate it - just iterate:
public static int FindNextPointerIndex(int oldIndex, string oldPointer, ...)
{
for(int i = oldIndex + 1; i < collection.Count ; i++)
{
if(collection[i].DateDetails != oldPointer) return i;
}
return -1;
}
If you want something more elegant, you will have to pre-index the data by DateDetails, presumably using something like a ToLookup over the entire collection, but: note that this makes changes to the data more complicated.
Have a look at Skip List , http://en.wikipedia.org/wiki/Skip_list
It will allow you to jump forward more than 1 in your linked list, but the down side to find the start of your search will be O(n)
I've got List of sctructs. In struct there is field x. I would like to select those of structs, which are rather close to each other by parameter x. In other words, I'd like to clusterise them by x.
I guess, there should be one-line solution.
Thanks in advance.
If I understood correctly what you want, then you might need to sort your list by the structure's field X.
Look at the GroupBy extension method:
var items = mylist.GroupBy(c => c.X);
This article gives a lot of examples using group by.
If you're doing graph-style clustering, the easiest way to do it is by building up a list of clusters which is initially empty. Then loop over the input and, for each value, find all of the clusters which have at least one element which is close to the current value. All those clusters should then be merged together with the value. If there aren't any, then the value goes into a cluster all by itself.
Here is some sample code for how to do it with a simple list of integers.
IEnumerable<int> input;
int threshold;
List<List<int>> clusters = new List<List<int>>();
foreach(var current in input)
{
// Search the current list of clusters for ones which contain at least one
// entry such that the difference between it and x is less than the threshold
var matchingClusters =
clusters.Where(
cluster => cluster.Any(
val => Math.Abs(current - val) <= threshold)
).ToList();
// Merge all the clusters that were found, plus x, into a new cluster.
// Replace all the existing clusters with this new one.
IEnumerable<int> newCluster = new List<int>(new[] { current });
foreach (var match in matchingClusters)
{
clusters.Remove(match);
newCluster = newCluster.Concat(match);
}
clusters.Add(newCluster.ToList());
}