Merge 2 sorted time series algorithm

Merge 2 sorted time series algorithm - c#

I have 2 time series that contain Bar objects, each Bar object contains a member variable of type long and each time series is stored within its own BlockingCollection. The time series is sorted in ascending order of the long values.
I like to devise a merge algorithm that allows me to take away the Bar that contains the long member variable of lowest value relative to the same comparison element in the other BlockingCollection.
Example, if the long value contained in the first Bar (bar1) in BlockingCollection1 is lower than the long value contained in the first Bar (bar2) in BlockingCollection2, then Take() from BlockingCollection1 and Add() to a MasterBlockingCollection, essentially ending up with a merged stream of Bar objects sorted by the value of each Bar's long member variable.
I like to later on extend to n BlockingCollections, not just 2. I played around with arrays that hold the long values to make the mapping easier but I think arrays are handier when working with pointers pertaining to this specific target algorithm.
I wonder whether anyone can point me to a Linq implementation and comment on how computationally expensive such approach is. I am asking because throughput is of importance as there are hundreds of millions of Bar objects flowing through the collections. If someone has a more clever idea than using Linq that would be very welcomed. I came across some ideas re merge algorithm at DrDobbs some time ago but cannot find the article anymore. In case it is not apparent by now, I target C# (.Net4.0)
Thanks a lot
Edit: I forgot to mention that the merging process is supposed to happen at the same time than workers who add new items to the blockingcollections (running on different tasks)

Here's an implementation of Merge. It should run in O(cN) time, where c is the number of collections. Is this what you're looking for?
public static BlockingCollection<Bar> Merge(IEnumerable<BlockingCollection<Bar>> collections)
{
BlockingCollection<Bar> masterCollection = new BlockingCollection<Bar>();
LinkedList<BarWrapper> orderedLows = new LinkedList<BarWrapper>();
foreach (var c in collections)
OrderedInsert(new BarWrapper { Value = c.Take(), Source = c }, orderedLows);
while (orderedLows.Any())
{
BarWrapper currentLow = orderedLows.First.Value;
orderedLows.RemoveFirst();
BlockingCollection<Bar> collection = currentLow.Source;
if (collection.Any())
OrderedInsert(new BarWrapper { Value = collection.Take(), Source = collection }, orderedLows);
masterCollection.Add(currentLow.Value);
}
return masterCollection;
}
private static void OrderedInsert(BarWrapper bar, LinkedList<BarWrapper> orderedLows)
{
if (!orderedLows.Any())
{
orderedLows.AddFirst(bar);
return;
}
var iterator = orderedLows.First;
while (iterator != null && iterator.Value.Value.LongValue < bar.Value.LongValue)
iterator = iterator.Next;
if (iterator == null)
orderedLows.AddLast(bar);
else
orderedLows.AddBefore(iterator, bar);
}
class BarWrapper
{
public Bar Value { get; set; }
public BlockingCollection<Bar> Source { get; set; }
}
class Bar
{
public Bar(long l)
{
this.LongValue = l;
}
public long LongValue { get; set; }
}

Related

best data structure for storing large number of numeric fields

I am working with a class, say Widget, that has a large number of numeric real world attributes (eg, height, length, weight, cost, etc.). There are different types of widgets (sprockets, cogs, etc.), but each widget shares the exact same attributes (the values will be different by widget, of course, but they all have a weight, weight, etc.). I have 1,000s of each type of widget (1,000 cogs, 1,000 sprockets, etc.)
I need to perform a lot of calculations on these attributes (say calculating the weighted average of the attributes for 1000s of different widgets). For the weighted averages, I have different weights for each widget type (ie, I may care more about length for sprockets than for cogs).
Right now, I am storing all the attributes in a Dictionary< string, double> within each widget (the widgets have an enum that specifies their type: cog, sprocket, etc.). I then have some calculator classes that store weights for each attribute as a Dictionary< WidgetType, Dictionary< string, double >>. To calculate the weighted average for each widget, I simply iterate through its attribute dictionary keys like:
double weightedAvg = 0.0;
foreach (string attibuteName in widget.Attributes.Keys)
{
double attributeValue = widget.Attributes[attributeName];
double attributeWeight = calculator.Weights[widget.Type][attributeName];
weightedAvg += (attributeValue * attributeWeight);
}
So this works fine and is pretty readable and easy to maintain, but is very slow for 1000s of widgets based on some profiling. My universe of attribute names is known and will not change during the life of the application, so I am wondering what some better options are. The few I can think of:
1) Store attribute values and weights in double []s. I think this is probably the most efficient option, but then I need to make sure the arrays are always stored in the correct order between widgets and calculators. This also decouples the data from the metadata so I will need to store an array (?) somewhere that maps between the attribute names and the index into double [] of attribute values and weights.
2) Store attribute values and weights in immutable structs. I like this option because I don't have to worry about the ordering and the data is "self documenting". But is there an easy way to loop over these attributes in code? I have almost 100 attributes, so I don't want to hardcode all those in the code. I can use reflection, but I worry that this will cause even a larger penalty hit since I am looping over so many widgets and will have to use reflection on each one.
Any other alternatives?

Three possibilities come immediately to mind. The first, which I think you rejected too readily, is to have individual fields in your class. That is, individual double values named height, length, weight, cost, etc. You're right that it would be more code to do the calculations, but you wouldn't have the indirection of dictionary lookup.
Second is to ditch the dictionary in favor of an array. So rather than a Dictionary<string, double>, you'd just have a double[]. Again, I think you rejected this too quickly. You can easily replace the string dictionary keys with an enumeration. So you'd have:
enum WidgetProperty
{
First = 0,
Height = 0,
Length = 1,
Weight = 2,
Cost = 3,
...
Last = 100
}
Given that and an array of double, you can easily go through all of the values for each instance:
for (int i = (int)WidgetProperty.First; i < (int)WidgetProperty.Last; ++i)
{
double attributeValue = widget.Attributes[i];
double attributeWeight = calculator.Weights[widget.Type][i];
weightedAvg += (attributeValue * attributeWeight);
}
Direct array access is going to be significantly faster than accessing a dictionary by string.
Finally, you can optimize your dictionary access a little bit. Rather than doing a foreach on the keys and then doing a dictionary lookup, do a foreach on the dictionary itself:
foreach (KeyValuePair<string, double> kvp in widget.Attributes)
{
double attributeValue = kvp.Value;
double attributeWeight = calculator.Weights[widget.Type][kvp.Key];
weightedAvg += (attributeValue * attributeWeight);
}

To calculate weighted averages without looping or reflection, one way would be to calculate the weighted average of the individual attributes and store them in some place. This should happen while you are creating instance of the widget. Following is a sample code which needs to be modified to your needs.
Also, for further processing of the the widgets themselves, you can use data parallelism. see my other response in this thread.
public enum WidgetType { }
public class Claculator { }
public class WeightStore
{
static Dictionary<int, double> widgetWeightedAvg = new Dictionary<int, double>();
public static void AttWeightedAvgAvailable(double attwightedAvg, int widgetid)
{
if (widgetWeightedAvg.Keys.Contains(widgetid))
widgetWeightedAvg[widgetid] += attwightedAvg;
else
widgetWeightedAvg[widgetid] = attwightedAvg;
}
}
public class WidgetAttribute
{
public string Name { get; }
public double Value { get; }
public WidgetAttribute(string name, double value, WidgetType type, int widgetId)
{
Name = name;
Value = value;
double attWeight = Calculator.Weights[type][name];
WeightStore.AttWeightedAvgAvailable(Value*attWeight, widgetId);
}
}
public class CogWdiget
{
public int Id { get; }
public WidgetAttribute height { get; set; }
public WidgetAttribute wight { get; set; }
}
public class Client
{
public void BuildCogWidgets()
{
CogWdiget widget = new CogWdiget();
widget.Id = 1;
widget.height = new WidgetAttribute("height", 12.22, 1);
}
}

As it is always the case with data normalization, is that choosing your normalization level determines a good part of the performance. It looks like you would have to go from your current model to another model or a mix.
Better performance for your scenario is possible when you do not process this with the C# side, but with the database instead. You then get the benefit of indexes, no data transfer except the wanted result, plus 100000s of man hours already spent on performance optimization.

Use Data Parallelism supported by the .net 4 and above.
https://msdn.microsoft.com/en-us/library/dd537608(v=vs.110).aspx
An excerpt from the above link
When a parallel loop runs, the TPL partitions the data source so that the loop can operate on multiple parts concurrently. Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced

Concurrent Collection with fastest possible Add, Remove and Find the highest

I am doing some heavy computations in C# .NET and when doing these computations in parallel.for loop I must collect some data in collection, but because of limited memory I can't collect all results, so I only store the best ones.
Those computations must be as fast as possible because they are already taking too much time. So after optimizing a lot I find out that the slowest thing was my ConcurrentDictionary collection. I am wondering if I should switch to something with faster add, remove and find the highest (perhaps a sorted collection) and just use locks for my main operation or I can do something good using ConcurrentColletion and speed up it a little.
Here is my actual code, I know it's bad because of this huge lock, but without it I seem to lose consistency and a lot of my remove attempts are failing.
public class SignalsMultiValueConcurrentDictionary : ConcurrentDictionary<double, ConcurrentBag<Signal>>
{
public int Limit { get; set; }
public double WorstError { get; private set; }
public SignalsDictionaryState TryAddSignal(double key, Signal signal, out Signal removed)
{
SignalsDictionaryState state;
removed = null;
if (this.Count >= Limit && signal.AbsoluteError > WorstError)
return SignalsDictionaryState.NoAddedNoRemoved;
lock (this)
{
if (this.Count >= Limit)
{
ConcurrentBag<Signal> signals;
if (TryRemove(WorstError, out signals))
{
removed = signals.FirstOrDefault();
state = SignalsDictionaryState.AddedAndRemoved;
}
else
state = SignalsDictionaryState.AddedFailedRemoved;
}
else
state = SignalsDictionaryState.AddedNoRemoved;
this.Add(key, signal);
WorstError = Keys.Max();
}
return state;
}
private void Add(double key, Signal value)
{
ConcurrentBag<Signal> values;
if (!TryGetValue(key, out values))
{
values = new ConcurrentBag<Signal>();
this[key] = values;
}
values.Add(value);
}
}
Note also because I use absolute error of signal, sometimes (should be very rare) I store more than one value on one key.
The only operation used in my computations is TryAddSignal because it does what I want -> if I have more signlas than limit then it removes signal with highest error and adds new signal.
Because of the fact that I set Limit property at the start of the computations I don't need a resizable collection.
The main problem here is even without that huge lock, Keys.Max is a little too slow. So maybe I need other collection?

Keys.Max() is the killer. That's O(N). No need for a dictionary if you do this.
You can't incrementally compute the max value because you are adding and removing. So you better use a data structure that is made for this. Trees usually are. The BCL has SortedList and SortedSet and SortedDictionary I believe. One of them was based on a fast tree. It has min and max operations.
Or, use a .NET collection library with a priority queue.
Bug: Add is racy. You might overwrite a non-empty collection.

The large lock statement is at least dubious. An easier improvement, if you say that Keys.Max() is slow, is to calculate the maximum value incrementally. You'll need to refresh it only after removing a key:
//...
if (TryRemove(WorstError, out signals))
{
WorstError = Keys.Max();
//...
WorstError = Math.Max(WorstError, key);

What I did in the end was to implement Heap based on binary-tree as was suggested by #usr. My final collection was not concurrent but synchronized (I used locks). I checked performance thought and it does the job fast enough.
Here is pseudocode:
public class SynchronizedCollectionWithMaxOnTop
{
double Max => _items[0].AbsoluteError;
public ItemChangeState TryAdd(Item item, out Item removed)
{
ItemChangeState state;
removed = null;
if (_items.Count >= Limit && signal.AbsoluteError > Max)
return ItemChangeState.NoAddedNoRemoved;
lock (this)
{
if (_items.Count >= Limit)
{
removed = Remove();
state = ItemChangeState.AddedAndRemoved;
}
else
state = ItemChangeState.AddedNoRemoved;
Insert(item);
}
return state;
}
private void Insert(Item item)
{
_items.Add(item);
HeapifyUp(_items.Count - 1);
}
private void Remove()
{
var result = new Item(_items[0]);
var lastIndex = _items.Count - 1;
_items[0] = _items[lastIndex];
_items.RemoveAt(lastIndex);
HeapifyDown(0);
return result;
}
}

Is it possible to change an element in a queue?

Let say if I have a queue of integer (or any class T), can I change the value of the element in the queue?
More specifically, if I define the queue as follow:
Queue<int> q = new Queue<int>();
Can we change the value of its element similar to how we deal with an array? (if q were an array, we would be able to do something like this: q[0]=1 to change its element). I just would like to simplify the scenario and use int as example, but my intention was trying to peek at the 1st item of a class T in a queue, do some calculations and update the queue for other programs to process. I do not want to dequeue it because it the sequence in the queue will then not be the same as the original. Hope what am trying to do make sense. Please advise.

If the item in the queue was a mutable type then you could change the value that the queue has as it's first item. Without re-creating the queue, or performing a lot of enqueues/dequeues there is no way to change which item is at the front of the queue.
As an example of the first case, if you had a Queue<MyClass> with a definition of:
class MyClass
{
public string Value { get; set; }
}
Queue<MyClass> queue = new Queue<MyClass>();
queue.Enqueue(new MyClass() { Value = "1" });
queue.Peek().Value = 2;
string value = queue.Peek().Value; // is 2

You can't directly change an item in Queue (although you can use a workaround as Tudor suggested). But if you want to have a queue, you don't have to use Queue. Another possible type from .Net is LinkedList. It allows you to add and remove things from both ends, which can be used in your scenario:
LinkedList<int> list = new LinkedList<int>();
// enqueue an item
list.AddLast(1);
// dequeue an item
var item = list.First.Value;
list.RemoveFirst();
// put item back to the front of the queue
list.AddFirst(item);
It seems you want to do this to process each item by several modules in sequence. But I'm not sure this is the right way to do this kind of work. A better way might be to have a queue between each two modules. A module would always take an item from its input queue, process it and then put it in its output queue.
One of the advantages of this approach is greater flexibility: a module can have different type on the output than on the input, which is not possible with the “one queue” approach (unless you resort to having a queue of objects, or something like that).
TPL Dataflow (new in .Net 4.5) uses this approach to improve performance through parallelization. It can do that, because each module can process items independently of other modules if you don't have a single central queue.

As long as you're storing a reference type like a class, any changes you make to it will be reflected in the Queue. The output of the code below will be "2":
public class MyClass
{
public int Value { get; set; }
}
static void Main(string[] args)
{
Queue<MyClass> q = new Queue<MyClass>();
q.Enqueue(new MyClass { Value = 1 });
var i = q.Peek();
i.Value++;
i = q.Peek();
Console.WriteLine(i.Value);
}

You could use a simple wrapper:
class Wrapper<T>
{
public T Value { get; set; }
}
static void Main(string[] args)
{
Queue<Wrapper<int>> q = new Queue<Wrapper<int>>();
Wrapper<int> wr = new Wrapper<int> { Value = 1 };
q.Enqueue(wr);
Wrapper<int> wr1 = q.Peek();
wr1.Value = 2;
int value = q.Dequeue().Value;
Console.WriteLine(value);
}

public static class Extensions
{
public static Queue<T> SetFirstTo<T>(this Queue<T> q, T value)
{
T[] array = q.ToArray();
array[0] = value;
return new Queue<T>(array);
}
}
Strictly this is not mutating the Queue so re-assignment required.
[TestMethod]
public void Queue()
{
var queue = new Queue<int>(new[]{1,2,3,4});
queue = queue.SetFirstTo(9);
Assert.AreEqual(queue.Peek(),9);
}

The simple answer is no. It's not part of the API of Queue object
http://msdn.microsoft.com/en-us/library/system.collections.queue.aspx
However anything is possible of course. You could write an extension method to do this, but it would have to work with the API of the object and so dequeue / enqueue all items along with the change whilst preserving the order.
But if you want to do this, you are treating the Queue as a List, so why not use a List?

How to access the reference values of a HashSet<TValue> without enumeration?

I have this scenario in which memory conservation is paramount. I am trying to read in > 1 GB of Peptide sequences into memory and group peptide instances together that share the same sequence. I am storing the Peptide objects in a Hash so I can quickly check for duplication, but found out that you cannot access the objects in the Set, even after knowing that the Set contains that object.
Memory is really important and I don't want to duplicate data if at all possible. (Otherwise I would of designed my data structure as: peptides = Dictionary<string, Peptide> but that would duplicate the string in both the dictionary and Peptide class). Below is the code to show you what I would like to accomplish:
public SomeClass {
// Main Storage of all the Peptide instances, class provided below
private HashSet<Peptide> peptides = new HashSet<Peptide>();
public void SomeMethod(IEnumerable<string> files) {
foreach(string file in files) {
using(PeptideReader reader = new PeptideReader(file)) {
foreach(DataLine line in reader.ReadNextLine()) {
Peptide testPep = new Peptide(line.Sequence);
if(peptides.Contains(testPep)) {
// ** Problem Is Here **
// I want to get the Peptide object that is in HashSet
// so I can add the DataLine to it, I don't want use the
// testPep object (even though they are considered "equal")
peptides[testPep].Add(line); // I know this doesn't work
testPep.Add(line) // THIS IS NO GOOD, since it won't be saved in the HashSet which i use in other methods.
} else {
// The HashSet doesn't contain this peptide, so we can just add it
testPep.Add(line);
peptides.Add(testPep);
}
}
}
}
}
}
public Peptide : IEquatable<Peptide> {
public string Sequence {get;private set;}
private int hCode = 0;
public PsmList PSMs {get;set;}
public Peptide(string sequence) {
Sequence = sequence.Replace('I', 'L');
hCode = Sequence.GetHashCode();
}
public void Add(DataLine data) {
if(PSMs == null) {
PSMs = new PsmList();
}
PSMs.Add(data);
}
public override int GethashCode() {
return hCode;
}
public bool Equals(Peptide other) {
return Sequence.Equals(other.Sequence);
}
}
public PSMlist : List<DataLine> { // and some other stuff that is not important }
Why does HashSet not let me get the object reference that is contained in the HashSet? I know people will try to say that if HashSet.Contains() returns true, your objects are equivalent. They may be equivalent in terms of values, but I need the references to be the same since I am storing additional information in the Peptide class.
The only solution I came up with is Dictionary<Peptide, Peptide> in which both the key and value point to the same reference. But this seems tacky. Is there another data structure to accomplish this?

Basically you could reimplement HashSet<T> yourself, but that's about the only solution I'm aware of. The Dictionary<Peptide, Peptide> or Dictionary<string, Peptide> solution is probably not that inefficient though - if you're only wasting a single reference per entry, I would imagine that would be relatively insignificant.
In fact, if you remove the hCode member from Peptide, that will safe you 4 bytes per object which is the same size as a reference in x86 anyway... there's no point in caching the hash as far as I can tell, as you'll only compute the hash of each object once, at least in the code you've shown.
If you're really desperate for memory, I suspect you could store the sequence considerably more efficiently than as a string. If you give us more information about what the sequence contains, we may be able to make some suggestions there.
I don't know that there's any particularly strong reason why HashSet doesn't permit this, other than that it's a relatively rare requirement - but it's something I've seen requested in Java as well...

Use a Dictionary<string, Peptide>.

c# Memory performance and speedup

i have a situation, i need to process an jagged array of 20k registers every time a user press a key. I have a grid and while the user is typing the system shows a filtered result in a grid. so. So i have a jagged array filled with all 20k registers. and the i have a list (global to the control) and it´s cleaned up every time the user press a key and filled up with just the filtered registers and then show then in the grid.
Here is the code
the model
public struct PlayerLookUpAdapter
{
[Browsable(false)]
public decimal Id { get; set; }
[DisplayName("Número")]
public String Number { get; set; }
[DisplayName("Nombre")]
public String Name { get; set; }
[DisplayName("Apellido")]
public String Surname { get; set; }
[DisplayName("DNI")]
public String Document { get; set; }
[DisplayName("Estado")]
public String Status { get; set; }
}
private PlayerLookUpAdapter[] _source; // here are the 20k registers
List<PlayerLookUpAdapter> filteredOut = new List<PlayerLookUpAdapter>(); // here the filtered ones
// this code is executed every time the user press a key
private void tb_nro_KeyUp(object sender, KeyEventArgs e)
{
if (!(e.KeyCode.Equals(Keys.Enter) || e.KeyCode.Equals(Keys.Down)) && _source!=null)
{
String text = tb_nro.Text.ToUpper();
if (String.IsNullOrEmpty(text))
{
fg.DataSource = _source;
fg.Refresh();
return;
}
fg.DataSource = null;
filteredOut.Clear();
int length = _source.Length;
for (int i = 0; i < length; i++)
{
PlayerLookUpAdapter cur = _source[i];
if (cur.Number.ToUpper().StartsWith(text) || cur.Surname.ToUpper().StartsWith(text) || cur.Name.ToUpper().StartsWith(text))
filteredOut.Add(cur);
}
fg.DataSource = filteredOut;
SetGridColumnsProperties();
fg.Refresh();
}
else
{
fg.Focus();
}
}
is it a good solution in terms of memory usage and performance? have you got any advice? How can i gain more speed. It works realy good, but what about if i got 100k registers instead of 20k?
Thanks in advance.

I think this should be a prime example for using a tree.
If you lay your Data down in a Tree (i actually don't know if C#/.Net supports a Tree Data-Structure, or you have get your own hands dirty).
The Speed you search in a Tree will increase in comparison for searching in an Array (because a Tree gots a search-speed of somehting like O(n)=n*log(n))
The Theory is easy: if a User Types in a Literal, the Tree goes to the Node starting with this Literal, on this nodes are all possible other nodes and so on. For example: The User types in an "t" you go to the "t" Node, then he types in an "e" you go to the subnode "te", there are some other subnodes like "test" and the system will propose the User these subnodes.

firts of all you could improve a bit your code: the StartWith method has an overload who takes the string comparison as well. you could set it as "OrdinalIgnoreCase" to avoid to upper all the strings but I don't think you will gain a lot.
The only way you have to speed up you search is go for a Search engine as Lucene.net.
http://www.codeproject.com/KB/library/IntroducingLucene.aspx

You want a prefix tree for this.
Here is one implementation:
A Reusable Prefix Tree using Generics in C# 2.0

You could probably use the StringComparison.OrdinalIgnoreCase option on your string comparisons and avoid having to call ToUpper on all your strings 20k times.
Ideally, first you need to decide how slow is too slow based on your best estimates for typical usage of your program. After all premature optimisation is the root of all evil.

Precalculate the ToUpper() call so you dont have to do it every time. You could maintain a second list where all the strings are stored uppercase.
Secondly you should search the filtered list (instead of the whole list) in case a key is added to the search string. The new (longer) string can never be outside of the filtered results.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.