I have a data structure of phone calls. For this question there are two fields, CallTime and NumberDialled.
The analysis I want to perform is "Are there more than two calls to the same number in a 10 second window" The collection is sorted by CallTime already and is a List<Cdr>.
My solution is
List<Cdr> records = GetRecordsSortedByCallTime();
for (int i = 0; i < records.Count; i++)
{
var baseRecord = records[i];
for (int j = i; j < records.Count; j++)
{
var comparisonRec = records[j];
if (comparisonRec.CallTime.Subtract(baseRecord.CallTime).TotalSeconds < 20)
{
if (comparisonRec.NumberDialled == baseRecord.NumberDialled)
ReportProblem(baseRecord, comparisonRec);
}
else
{
// We're more than 20 seconds away from the base record. Break out of the inner loop
break;
}
}
}
Whis is ugly to say the least. Is there a better, cleaner and faster way of doing this?
Although I haven't tested this on a large data set, I will be running it on about 100,000 records per hour so there will be a large number of comparisons for each record.
Update The data is sorted by time not number as in an earlier version of the question
If the phone calls are already sorted by call time, you can do the following:
Initialize a hash table that has a counter for every phone number (the hash table can be first empty and you add elements to it as you go)
Have two pointers to the linked list of yours, let's call them 'left' and 'right'
Whenever the timestamp between the 'left' and 'right' call is less than 10 seconds, move 'right' forwards by one, and increment the count of the newly encountered phone number by one
Whenever the difference is above 10 seconds, move 'left' forwards by one and decrement the count for the phone number from which 'left' pointer left by one
At any point, if there is a phone number whose counter in the hash table is 3 or more, you have found a phone number that has more than 2 calls within a 10 seconds window
This is a linear-time algorithm and processes all the numbers in parallel.
I didn't know you exact structures, so I created my own for this demonstration:
class CallRecord
{
public long NumberDialled { get; set; }
public DateTime Stamp { get; set; }
}
class Program
{
static void Main(string[] args)
{
var calls = new List<CallRecord>()
{
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,0) },
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,9) },
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,18) },
};
var dupCalls = calls.Where(x => calls.Any(y => y.NumberDialled == x.NumberDialled && (x.Stamp - y.Stamp).Seconds > 0 && (x.Stamp - y.Stamp).Seconds <= 10)).Select(x => x.NumberDialled).Distinct();
foreach (var dupCall in dupCalls)
{
Console.WriteLine(dupCall);
}
Console.ReadKey();
}
}
The LINQ expression loops through all records and finds records which are ahead of the current record (.Seconds > 0), and within the time limit (.Seconds <= 10). This might be a bit of a performance hog due to the Any method constantly going over your whole list, but at least the code is cleaner :)
I recommand you to use Rx Extension and the Interval method.
The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs using observable sequences and LINQ-style query operators. Using Rx, developers represent asynchronous data streams with Observables, query asynchronous data streams using LINQ operators, and parameterize the concurrency in the asynchronous data streams using Schedulers
The Interval method returns an observable sequence that produces a value after each period
Here is quick example :
var callsPer10Seconds = Observable.Interval(TimeSpan.FromSeconds(10));
from x in callsPer10Seconds
group x by x into g
let count = g.Count()
orderby count descending
select new {Value = g.Key, Count = count};
foreach (var x in q)
{
Console.WriteLine("Value: " + x.Value + " Count: " + x.Count);
}
records.OrderBy(p => p.CallTime)
.GroupBy(p => p.NumberDialled)
.Select(p => new { number = p.Key, cdr = p.ToList() })
.Select(p => new
{
number = p.number,
cdr =
p.cdr.Select((value, index) => index == 0 ? null : (TimeSpan?)(value.CallTime - p.cdr[index - 1].CallTime))
.FirstOrDefault(q => q.HasValue && q.Value.TotalSeconds < 10)
}).Where(p => p.cdr != null);
In two steps :
Generate an enumeration with the call itself and all calls in the interesting span
Filter this list to find consecutive calls
The computation is done in parallel on each record using the AsParallel extension method.
It is also possible to not call the ToArray at the end and let the computation be done while other code could execute on the thread instead of forcing it to wait for the parallel computation to finish.
var records = new [] {
new { CallTime= DateTime.Now, NumberDialled = 1 },
new { CallTime= DateTime.Now.AddSeconds(1), NumberDialled = 1 }
};
var span = TimeSpan.FromSeconds(10);
// Select for each call itself and all other calls in the next 'span' seconds
var callInfos = records.AsParallel()
.Select((r, i) =>
new
{
Record = r,
Following = records.Skip(i+1)
.TakeWhile(r2 => r2.CallTime - r.CallTime < span)
}
);
// Filter the calls that interest us
var problematic = (from callinfo in callInfos
where callinfo.Following.Any(r => callinfo.Record.NumberDialled == r.NumberDialled)
select callinfo.Record)
.ToArray();
If performance is acceptable (which I think it should be, since 100k records is not particularly many), this approach is (I think) nice and clean:
First we group up the records by number:
var byNumber =
from cdr in calls
group cdr by cdr.NumberDialled into g
select new
{
NumberDialled = g.Key,
Calls = g.OrderBy(cdr => cdr.CallTime)
};
What we do now is Zip (.NET 4) each calls collection with itself-shifted-by-one, to transform the list of call times into a list of gaps between calls. We then look for numbers where there's a gap of at most 10 seconds:
var interestingNumbers =
from g in byNumber
let callGaps = g.Calls.Zip(g.Calls.Skip(1),
(cdr1, cdr2) => cdr2.CallTime - cdr1.CallTime)
where callGaps.Any(ts => ts.TotalSeconds <= 10)
select g.NumberDialled;
Now interestingNumbers is a sequence of the numbers of interest.
Related
I have issue when using Merge for shared observables. In my project I have several streams that loads different data which must be added by certain order. So I've made simple example to find a solution. And the issue is that the 2nd merged observable won't get emitted values for obvious reasons.
If I remove Share operator everything would be fine but in this case root observable executes 2 times.
So another option is to add Replay operator after Share. But than I have to use Connect somewhere. Unfortunately in my project the observable is just a small part of huge loading chain.
And that's where I stucked.
Next code shows what problem is. observableFlatMap variable doesn't emit anything because every value that sharedObservable emits goes through observableNotEven and observableFlatMap connects only after every integer goes away.
using System.Linq;
using UniRx;
using UniRx.Diagnostics;
using UnityEngine;
public class Share : MonoBehaviour
{
void Start()
{
PrintNumbers();
}
private void PrintNumbers()
{
System.IObservable<int> sharedObservable = GetObservableInts();
var observableEven = sharedObservable.
Where(x => x % 2 == 0)
.Debug("Even");
var observableNotEven = sharedObservable.
Where(x => x % 2 == 1)
.Debug("NotEven");
var observableFlatMap = observableEven
.Select(x => x * 10);
_ = Observable.Merge(observableNotEven, observableFlatMap)
.Subscribe(_number => Debug.Log(_number))
.AddTo(this);
}
private static System.IObservable<int> GetObservableInts()
{
var count = 10;
var arrayInt = new int[count];
for (int id = 0; id != count; ++id)
arrayInt[id] = id;
var sharedObservable = arrayInt.ToObservable()
.Debug("Array")
.Share();
return sharedObservable;
}
I found 2 similar solution:
Use this operators .DelayFrame/Delay or .DelayFramsSubscription/DelaySubscription
before .Share
Feature of .Share behaviour is that it starts to emit values after first subscription and it continues to emit until last subscriber unsubscribes. And in my case after first subscriber (observableNotEven) .Share emits each value of integer array before next observable (observableFlatMap) connected (merged) to general observable sequence
var sharedObservable = arrayInt.ToObservable()
.Debug("Array")
.DelayFrameSubscription(1)
.Share();
Update
The answer to my question is to use Publish even if you get shared observable.
private void PrintNumbers()
{
IConnectableObservable<int> sharedObservable = GetObservableInts().Publish();
var observableEven = sharedObservable
.Where(x => x % 2 == 0)
.Debug("Even");
var observableNotEven = sharedObservable
.Where(x => x % 2 == 1)
.Debug("NotEven");
var observableFlatMap = observableEven
.Select(x => x * 10);
_ = Observable.Merge(observableNotEven, observableFlatMap)
.Subscribe(_number => Debug.Log(_number))
.AddTo(this);
//After observable chain building is complete. You should use *Connect*
sharedObservable.Connect().AddTo(this);
}
I have the below snippet which takes a long time to run as the data increase.
OrderEntityColection is a List and samplePriceList is a List
OrderEntityColection = 30k trades
samplePriceList = 1million prices
Takes easily 10-15 minute to finish or more
I have tested this with 1500 orders and 300k prices but it takes around 40-50 seconds as well and as the orders increase so do prices and even takes longer
Can you see how i can improve this. I have alreadyy cut it down to these numbers before in hand from a big set.
MarketId = int
Audit = string
foreach (var tradeEntity in OrderEntityColection)
{
Parallel.ForEach(samplePriceList,new ParallelOptions {MaxDegreeOfParallelism = 8}, (price) =>
{
if (price.MarketId == tradeEntity.MarketId)
{
if (tradeEntity.InstructionPriceAuditId == price.Audit)
{
// OrderExportColection.Enqueue(tradeEntity);
count++;
}
}
});
}
So you want to do data in memory, ok - you need to be smart about the way you formulate the data up front. First thing is you're getting a list of prices by MarketId - so create that first:
var pricesLookupByMarketId = samplePriceList.ToDictionary(
p => p.MarketId,
v => v.ToDictionary(k => k.Market));
Now you have a Dictionary<int,Dictionary<int,Price>>(); (note ive assumed both MarketId and Audit are ints. If they're not it should still work)
Now your code becomes super simple and a lot faster
foreach (var tradeEntity in OrderEntityColection)
{
if(pricesLookupByMarketId.ContainsKey(tradeEntity.MarketId)
&& pricesLookupByMarketId[tradeEntity.MarketId].ContainsKey(tradeEntity.InstructionPriceAuditId))
{
count++;
}
}
Or, if you'er a fan of one long line
var count = OrderEntityColection.Count(tradeEntity => pricesLookupByMarketId.ContainsKey(tradeEntity.MarketId)
&& pricesLookupByMarketId[tradeEntity.MarketId].ContainsKey(tradeEntity.InstructionPriceAuditId))
As pointed out in the comments, this can be further optimized to stop repeated reads of the dictionaries - but the exact implementation depends on how you want to use this data in the end.
In the parallel loop you have cases, where you skip the processing for certain items. That's quite expensive, as you rely on that check to also happen on a separate thread. I'd just filter out the results first before processing those, as follows:
foreach (var tradeEntity in OrderEntityColection)
{
Parallel.ForEach(samplePriceList.Where(item=>item.MarketId == tradeEntity.MarketId && item.Audit == tradeEntity.InstructionPriceAuditId) ,new ParallelOptions {MaxDegreeOfParallelism = 8}, (price) =>
{
// Do whatever processing is required here
Interlocked.Increment(ref count);
});
}
On a side note, seems like you need to replace count++ with Interlocked.Increment(ref count), to be thread safe.
Manage to do this with the help of my friend
var samplePriceList = PriceCollection.GroupBy(priceEntity=> priceEntity.MarketId).ToDictionary(g=> g.Key,g=> g.ToList());
foreach (var tradeEntity in OrderEntityColection)
{
var price = samplePriceList[tradeEntity.MarketId].FirstOrDefault(obj => obj.Audit == tradeEntity.Audit);
if (price != null)
{
count+=1;
}
}
I would like to do something like this (below) but not sure if there is a formal/optimized syntax to do so?
.Orderby(i => i.Value1)
.Take("Bottom 100 & Top 100")
.Orderby(i => i.Value2);
basically, I want to sort by one variable, then take the top 100 and bottom 100, and then sort those results by another variable.
Any suggestions?
var sorted = list.OrderBy(i => i.Value);
var top100 = sorted.Take(100);
var last100 = sorted.Reverse().Take(100);
var result = top100.Concat(last100).OrderBy(i => i.Value2);
I don't know if you want Concat or Union at the end. Concat will combine all entries of both lists even if there are similar entries which would be the case if your original list contains less than 200 entries. Union would only add stuff from last100 that is not already in top100.
Some things that are not clear but that should be considered:
If list is an IQueryable to a db, it probably is advisable to use ToArray() or ToList(), e.g.
var sorted = list.OrderBy(i => i.Value).ToArray();
at the beginning. This way only one query to the database is done while the rest is done in memory.
The Reverse method is not optimized the way I hoped for, but it shouldn't be a problem, since ordering the list is the real deal here. For the record though, the skip method explained in other answers here is probably a little bit faster but needs to know the number of elements in list.
If list would be a LinkedList or another class implementing IList, the Reverse method could be done in an optimized way.
You can use an extension method like this:
public static IEnumerable<T> TakeFirstAndLast<T>(this IEnumerable<T> source, int count)
{
var first = new List<T>();
var last = new LinkedList<T>();
foreach (var item in source)
{
if (first.Count < count)
first.Add(item);
if (last.Count >= count)
last.RemoveFirst();
last.AddLast(item);
}
return first.Concat(last);
}
(I'm using a LinkedList<T> for last because it can remove items in O(1))
You can use it like this:
.Orderby(i => i.Value1)
.TakeFirstAndLast(100)
.Orderby(i => i.Value2);
Note that it doesn't handle the case where there are less then 200 items: if it's the case, you will get duplicates. You can remove them using Distinct if necessary.
Take the top 100 and bottom 100 separately and union them:
var tempresults = yourenumerable.OrderBy(i => i.Value1);
var results = tempresults.Take(100);
results = results.Union(tempresults.Skip(tempresults.Count() - 100).Take(100))
.OrderBy(i => i.Value2);
You can do it with in one statement also using this .Where overload, if you have the number of elements available:
var elements = ...
var count = elements.Length; // or .Count for list
var result = elements
.OrderBy(i => i.Value1)
.Where((v, i) => i < 100 || i >= count - 100)
.OrderBy(i => i.Value2)
.ToArray(); // evaluate
Here's how it works:
| first 100 elements | middle elements | last 100 elements |
i < 100 i < count - 100 i >= count - 100
You can write your own extension method like Take(), Skip() and other methods from Enumerable class. It will take the numbers of elements and the total length in list as input. Then it will return first and last N elements from the sequence.
var result = yourList.OrderBy(x => x.Value1)
.GetLastAndFirst(100, yourList.Length)
.OrderBy(x => x.Value2)
.ToList();
Here is the extension method:
public static class SOExtensions
{
public static IEnumerable<T> GetLastAndFirst<T>(
this IEnumerable<T> seq, int number, int totalLength
)
{
if (totalLength < number*2)
throw new Exception("List length must be >= (number * 2)");
using (var en = seq.GetEnumerator())
{
int i = 0;
while (en.MoveNext())
{
i++;
if (i <= number || i >= totalLength - number)
yield return en.Current;
}
}
}
}
I have this code below that:
InstanceCollection instances = this.MyService(typeID, referencesIDs);
My problem here is when the referencesIDs.Count() is greater than a specific count, it throws an error which is related to SQL.
Suggested to me is to call the this.MyService multiple times so it won't process many referencesIDs.
What is the way to do that? I am thinking of using a while loop like this:
while (referencesIDs.Count() != maxCount)
{
newReferencesIDs = referencesIDs.Take(500).ToArray();
instances = this.MyService(typeID, newReferencesIDs);
maxCount += newReferencesIDs.Count();
}
The problem that I can see here is that how can I remove the first 500 referencesIDs on the newReferencesIDs? Because if I won't remove the first 500 after the first loop, it will continue to add the referencesIDs.
Are you just looking to update the referencesIDs value? Something like this?:
referencesIDs = referencesIDs.Skip(500);
Then the next time you call .Take(500) on referencesIDs it'll get the next 500 values.
Conversely, without updating the referencesIDs variable, you can include the Skip in your loop. Something like this:
var pageSize = 500;
var skipCount = 0;
while(...)
{
newReferencesIDs = referencesIDs.Skip(skipCount).Take(pageSize).ToArray();
skipCount += pageSize;
...
}
My first choice would be to fix the service, if you have access to it. A SQL-specific error could be a result of an incomplete database configuration, or a poorly written SQL query on the server. For example, Oracle limits IN lists in SQL queries to about 1000 items by default, but your Oracle DBA should be able to re-configure this limit for you. Alternatively, server side programmers could rewrite their query to avoid hitting this limit in the first place.
If this does not work, you could split your list into blocks of max size that does not trigger the error, make multiple calls to the server, and combine the instances on your end, like this:
InstanceCollection instances = referencesIDs
.Select((id, index) => new {Id = id, Index = index})
.GroupBy(p => p.Index / 500) // 500 is the max number of IDs
.SelectMany(g => this.MyService(typeID, g.Select(item => item.Id).ToArray()))
.ToList();
If you want a general way of splitting lists into chunks, you can use something like:
/// <summary>
/// Split a source IEnumerable into smaller (more manageable) lists.
/// </summary>
public static IEnumerable<IList<TSource>>
SplitIntoChunks<TSource>(this IEnumerable<TSource> source, int chunkSize)
{
long i = 1;
var list = new List<TSource>();
foreach (var t in source)
{
list.Add(t);
if (i++ % chunkSize == 0)
{
yield return list;
list = new List<TSource>();
}
}
if (list.Count > 0)
yield return list;
}
And then you can use SelectMany to flatten results:
InstanceCollection instances = referencesIDs
.SplitIntoChunks(500)
.SelectMany(chunk => MyService(typeID, chunk))
.ToList();
I have a loop like the following, can I do the same using multiple SUM?
foreach (var detail in ArticleLedgerEntries.Where(pd => pd.LedgerEntryType == LedgerEntryTypeTypes.Unload &&
pd.InventoryType == InventoryTypes.Finished))
{
weight += detail.GrossWeight;
length += detail.Length;
items += detail.NrDistaff;
}
Technically speaking, what you have is probably the most efficient way to do what you are asking. However, you could create an extension method on IEnumerable<T> called Each that might make it simpler:
public static class EnumerableExtensions
{
public static void Each<T>(this IEnumerable<T> col, Action<T> itemWorker)
{
foreach (var item in col)
{
itemWorker(item);
}
}
}
And call it like so:
// Declare variables in parent scope
double weight;
double length;
int items;
ArticleLedgerEntries
.Where(
pd =>
pd.LedgerEntryType == LedgerEntryTypeTypes.Unload &&
pd.InventoryType == InventoryTypes.Finished
)
.Each(
pd =>
{
// Close around variables defined in parent scope
weight += pd.GrossWeight;
lenght += pd.Length;
items += pd.NrDistaff;
}
);
UPDATE:
Just one additional note. The above example relies on a closure. The variables weight, length, and items should be declared in a parent scope, allowing them to persist beyond each call to the itemWorker action. I've updated the example to reflect this for clarity sake.
You can call Sum three times, but it will be slower because it will make three loops.
For example:
var list = ArticleLedgerEntries.Where(pd => pd.LedgerEntryType == LedgerEntryTypeTypes.Unload
&& pd.InventoryType == InventoryTypes.Finished))
var totalWeight = list.Sum(pd => pd.GrossWeight);
var totalLength = list.Sum(pd => pd.Length);
var items = list.Sum(pd => pd.NrDistaff);
Because of delayed execution, it will also re-evaluate the Where call every time, although that's not such an issue in your case. This could be avoided by calling ToArray, but that will cause an array allocation. (And it would still run three loops)
However, unless you have a very large number of entries or are running this code in a tight loop, you don't need to worry about performance.
EDIT: If you really want to use LINQ, you could misuse Aggregate, like this:
int totalWeight, totalLength, items;
list.Aggregate((a, b) => {
weight += detail.GrossWeight;
length += detail.Length;
items += detail.NrDistaff;
return a;
});
This is phenomenally ugly code, but should perform almost as well as a straight loop.
You could also sum in the accumulator, (see example below), but this would allocate a temporary object for every item in your list, which is a dumb idea. (Anonymous types are immutable)
var totals = list.Aggregate(
new { Weight = 0, Length = 0, Items = 0},
(t, pd) => new {
Weight = t.Weight + pd.GrossWeight,
Length = t.Length + pd.Length,
Items = t.Items + pd.NrDistaff
}
);
You could also group by true - 1 (which is actually including any of the items and then have them counted or summered):
var results = from x in ArticleLedgerEntries
group x by 1
into aggregatedTable
select new
{
SumOfWeight = aggregatedTable.Sum(y => y.weight),
SumOfLength = aggregatedTable.Sum(y => y.Length),
SumOfNrDistaff = aggregatedTable.Sum(y => y.NrDistaff)
};
As far as Running time, it is almost as good as the loop (with a constant addition).
You'd be able to do this pivot-style, using the answer in this topic: Is it possible to Pivot data using LINQ?
Ok. I realize that there isn't an easy way to do this using LINQ. I'll take may foreach loop because I understood that it isn't so bad. Thanks to all of you