How to determine if threads are being blocked by PLINQ Query? - c#

I am trying to find a Levenshtien match between two datasets. I try to find the edit distances of a ListA which contains thousands of values against another ListB which contains millions of records.
What I have achieved so far is I use a LINQ Projection to get a cross product of batches of values from sourceDataBatch(500 batchsize) and TargetDataBatch(500000 batchsize) and then run a parallel query to get the levenshtien product. I have a 8 core processor and the query works fine when I process a 1.5K List with 4.3 million records. But if I increase the List size to 2K it stops altogether. I suppose the threads get blocked after processing 1.5K records.
Please advise what/s happening in this scenario.
This is my code:
var pairs = (from wordToMatch in currentSourceDataBatch
from similarWord in currentTargetDataBatch//cross product
select new LevenshtienInput { WordToMatch = wordToMatch, SimilarWord = similarWord });
var matches = pairs.
AsParallel().
Where(pair => IsLevenshteinMatch(pair, threshold)).
ToList();
Here is my Levenshtien Method:
private static bool IsLevenshteinMatch(LevenshtienInput pair, double threshold)
{
int leven = Levenshtein.Distance(pair.WordToMatch, pair.SimilarWord);
int length = Math.Max(pair.WordToMatch.Length, pair.SimilarWord.Length);
double similarity = 1.0 - (double)leven / length;
if (similarity >= threshold)
{
pair.Similarity = similarity;
return true;
}
return false;
}

Related

How to find largest property in List?

I have linq expression "Where" that may returns several rows:
var checkedPrices = prices.Where(...).ToList();
As there are several rows, retrieves from db => i want to take the largest string from this list of rows.
Also there is a case when one of the fields may have same lenght, so i tried to find the largest from another field.
int countPrices = checkedPrices.Count();
if (checkedPrices == 0)
{
checkedPrices = null;
}
else if (checkedPrices == 1)
{
checkedPrices = checkedPrices.Take(1).ToList();
}
else if (countFixedPrices > 1)
{
var maxPrices1 = checkedPrices.Max(i => i.Field1.Length);
if (maxPrices1 > 1)
{
var maxPrices2 = checkedPrices.Max(i => i.Field2.Length);
checkedPrices = checkedPrices.IndexOf(maxPrices2 );
}
checkedPrices = checkedPrices .ElementAt(maxPrices2);
}
So, i have an issue in the last "else if".
My logic was to find the max largest of Field1.
If there is the only one largest field - rewrite it to the "Where" expression (checkedPrices).
If there is not only one max largest of Field1 => take the largest from Field2.
The problem of mine is i'm confused how could i take the row data from the largest Field1/Field2.
This part of code is ridiculously bad(doesnt even compile):
if (maxPrices1 > 1)
{
var maxPrices2 = checkedPrices.Max(i => i.Field2.Length);
checkedPrices = checkedPrices.IndexOf(maxPrices2 );
}
checkedPrices = checkedPrices .ElementAt(maxPrices2);
Since it seems you need only one price I would recommend just write correct query to fetch it only. You can order items (with LINQ's OrderByDescending and ThenByDescending) and the take the top one:
var checkedPrice = prices
.Where(...)
.OrderByDescending(c => c.Field1.Length)
.ThenByDescending(c => c.Field2.Length)
.FirstOrDefault();
P.S:
For LINQ-to-Objects this solution can be inefficient for large datasets after filtering due to sorting being O(n * log(n)) operation while finding maximum element is O(n) task.
There can be implementation depended LINQ optimizations for some of cases like combination of OrderBy(Descending) with some overloads of operators like First(OrDefault) and possibly Skip/Take (see one, two, three).

LINQ statements to do a nested list of DynamicTableEntities in chunks of 100

I am attempting to create batches of 100 records i can delete from Azure Table Storage. I found a great article on efficiently creating batches to delete table records here: https://blog.bitscry.com/2019/03/25/efficiently-deleting-rows-from-azure-table-storage/
and have followed this along. the issue i am facing that is different from the example in this blog post is that my deletes will have different partition keys. so rather than simply splitting my results into batches of 100 (as it does in the example) i first need to split them into groups of like partition keys, and THEN examine those lists, and further sub-divide them if the count is greater than 100 (as Azure recommends only batches of 100 records at a time, and they all require the same partition key)
Let me say i am TERRIBLE with enumerable LINQ and the non-query style that is described in this blog post so i'm a bit lost. i have written a small work around that does create these batches by the partition ID, and the code works to delete them, i just am not handling the possibility that there may be more than 100 rows to delete based on the partition key. So the code below is just used as an example to show you how i approached splitting the updates by partition key.
List<string> partitionKeys = toDeleteEntities.Select(x => x.PartitionKey).Distinct().ToList();
List<List<DynamicTableEntity>> chunks = new List<List<DynamicTableEntity>>();
for (int i = 0; i < partitionKeys.Count; ++i)
{
var count = toDeleteEntities.Where(x => x.PartitionKey == partitionKeys[i]).Count();
//still need to figure how to split by groups of 100.
chunks.Add(toDeleteEntities.Distinct().Where(x=>x.PartitionKey == partitionKeys[i]).ToList());
}
i have tried to do multiple groupby statements in a linq function similar to this
// Split into chunks of 100 for batching
List<List<TableEntity>> rowsChunked = tableQueryResult.Result.Select((x, index) => new { Index = index, Value = x })
.Where(x => x.Value != null)
.GroupBy(x => x.Index / 100)
.Select(x => x.Select(v => v.Value).ToList())
.ToList();
but once i add a second set parameters to group by (eg: x=>x.PartitionKey) then my select below starts to go pear shaped. The end result object is a LIST of LISTS that contain DyanmicTableEntities and an index
[0]
[0][Entity]
[1][Entity]
...
[99][Entity]
[1]
[0][Entity]
[1][Entity]
...
i hope this makes sense, if not please feel free to ask for clarification.
Thanks in advance.
EDIT FOR CLARIFICATION:
The idea is simply that i want to group by PARTITION Key AND only take 100 rows before creating another row of the SAME partition key and adding the rest of the rows
thanks,
One useful LINQ method you can use is GroupBy(keySelector). It basically divides your collection into groups based on a selector. So in your case, you'd probably want to group by PartitionKey:
var partitionGroups = toDeleteEntities.GroupBy(d => d.PartitionKey);
When you iterate through this collection, you'll get an IGrouping. Finally, to get the correct batch, you can use Skip(int count) and Take(int count)
foreach (var partitionGroup in partitionGroups)
{
var partitionKey = partitionGroup.Key;
int startPosition = 0;
int count = partitionGroup.Count();
while(count > 0)
{
int batchSize = count % maxBatchSize > 0 ? count % maxBatchSize : maxBatchSize;
var partitionBatch = partitionGroup.Skip(startPosition).Take(batchSize);
// process your batches here
chunks.Add(new List<DynamicTableEntry>(partitionBatch));
startPosition += batchSize;
count = count - batchSize;
}
}

Take specific number of array first then process

I have this code below that:
InstanceCollection instances = this.MyService(typeID, referencesIDs);
My problem here is when the referencesIDs.Count() is greater than a specific count, it throws an error which is related to SQL.
Suggested to me is to call the this.MyService multiple times so it won't process many referencesIDs.
What is the way to do that? I am thinking of using a while loop like this:
while (referencesIDs.Count() != maxCount)
{
newReferencesIDs = referencesIDs.Take(500).ToArray();
instances = this.MyService(typeID, newReferencesIDs);
maxCount += newReferencesIDs.Count();
}
The problem that I can see here is that how can I remove the first 500 referencesIDs on the newReferencesIDs? Because if I won't remove the first 500 after the first loop, it will continue to add the referencesIDs.
Are you just looking to update the referencesIDs value? Something like this?:
referencesIDs = referencesIDs.Skip(500);
Then the next time you call .Take(500) on referencesIDs it'll get the next 500 values.
Conversely, without updating the referencesIDs variable, you can include the Skip in your loop. Something like this:
var pageSize = 500;
var skipCount = 0;
while(...)
{
newReferencesIDs = referencesIDs.Skip(skipCount).Take(pageSize).ToArray();
skipCount += pageSize;
...
}
My first choice would be to fix the service, if you have access to it. A SQL-specific error could be a result of an incomplete database configuration, or a poorly written SQL query on the server. For example, Oracle limits IN lists in SQL queries to about 1000 items by default, but your Oracle DBA should be able to re-configure this limit for you. Alternatively, server side programmers could rewrite their query to avoid hitting this limit in the first place.
If this does not work, you could split your list into blocks of max size that does not trigger the error, make multiple calls to the server, and combine the instances on your end, like this:
InstanceCollection instances = referencesIDs
.Select((id, index) => new {Id = id, Index = index})
.GroupBy(p => p.Index / 500) // 500 is the max number of IDs
.SelectMany(g => this.MyService(typeID, g.Select(item => item.Id).ToArray()))
.ToList();
If you want a general way of splitting lists into chunks, you can use something like:
/// <summary>
/// Split a source IEnumerable into smaller (more manageable) lists.
/// </summary>
public static IEnumerable<IList<TSource>>
SplitIntoChunks<TSource>(this IEnumerable<TSource> source, int chunkSize)
{
long i = 1;
var list = new List<TSource>();
foreach (var t in source)
{
list.Add(t);
if (i++ % chunkSize == 0)
{
yield return list;
list = new List<TSource>();
}
}
if (list.Count > 0)
yield return list;
}
And then you can use SelectMany to flatten results:
InstanceCollection instances = referencesIDs
.SplitIntoChunks(500)
.SelectMany(chunk => MyService(typeID, chunk))
.ToList();

Join large list of Integers into LINQ Query

I have LINQ query that returns me the following error:
"The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100".
All I need is to count all clients that have BirthDate that I have their ID's in list.
My list of client ID's could be huge (millions of records).
Here is the query:
List<int> allClients = GetClientIDs();
int total = context.Clients.Where(x => allClients.Contains(x.ClientID) && x.BirthDate != null).Count();
When the query is rewritten this way,
int total = context
.Clients
.Count(x => allClients.Contains(x.ClientID) && x.BirthDate != null);
it causes the same error.
Also tried to make it in different way and it eats all memory:
List<int> allClients = GetClientIDs();
total = (from x in allClients.AsQueryable()
join y in context.Clients
on x equals y.ClientID
where y.BirthDate != null
select x).Count();
We ran into this same issue at work. The problem is that list.Contains() creates a WHERE column IN (val1, val2, ... valN) statement, so you're limited to how many values you can put in there. What we ended up doing was in fact do it in batches much like you did.
However, I think I can offer you a cleaner and more elegant piece of code to do this with. Here is an extension method that will be added to the other Linq methods you normally use:
public static IEnumerable<IEnumerable<T>> BulkForEach<T>(this IEnumerable<T> list, int size = 1000)
{
for (int index = 0; index < list.Count() / size + 1; index++)
{
IEnumerable<T> returnVal = list.Skip(index * size).Take(size).ToList();
yield return returnVal;
}
}
Then you use it like this:
foreach (var item in list.BulkForEach())
{
// Do logic here. item is an IEnumerable<T> (in your case, int)
}
EDIT
Or, if you prefer, you can make it act like the normal List.ForEach() like this:
public static void BulkForEach<T>(this IEnumerable<T> list, Action<IEnumerable<T>> action, int size = 1000)
{
for (int index = 0; index < list.Count() / size + 1; index++)
{
IEnumerable<T> returnVal = list.Skip(index * size).Take(size).ToList();
action.Invoke(returnVal);
}
}
Used like this:
list.BulkForEach(p => { /* Do logic */ });
Well as Gert Arnold mentioned before, making query in chunks solves the problem, but it looks nasty:
List<int> allClients = GetClientIDs();
int total = 0;
const int sqlLimit = 2000;
int iterations = allClients.Count() / sqlLimit;
for (int i = 0; i <= iterations; i++)
{
List<int> tempList = allClients.Skip(i * sqlLimit).Take(sqlLimit).ToList();
int thisTotal = context.Clients.Count(x => tempList.Contains(x.ClientID) && x.BirthDate != null);
total = total + thisTotal;
}
As has been said above, your query is probably being translated to:
select count(1)
from Clients
where ClientID = #id1 or ClientID = #id2 -- and so on up to the number of ids returned by GetClientIDs.
You will need to change your query such that you aren't passing so many parameters to it.
To see the generated SQL you can set the Clients.Log = Console.Out which will cause it to be written to the debug window when it is executed.
EDIT:
A possible alternative to chunking would be to send the IDs to the server as a delimited string, and create a UDF in your database which can covert that string back to a list.
var clientIds = string.Jon(",", allClients);
var total = (from client in context.Clients
join clientIds in context.udf_SplitString(clientIds)
on client.ClientId equals clientIds.Id
select client).Count();
There are lots of examples on Google for UDFs that split strings.
Another alternative and probably the fastest at query time is to add your numbers from the CSV file into a temporary table in your database and then do a join query.
Doing a query in chunks means a lot of round-trips between your client and database. If the list of IDs you are interested in is static or changes rarely, I recommend the approach of a temporary table.
If you don't mind moving the work from the database to the application server and have the memory, try this.
int total = context.Clients.AsEnumerable().Where(x => allClients.Contains(x.ClientID) && x.BirthDate != null).Count();

Sliding time window for record analysis

I have a data structure of phone calls. For this question there are two fields, CallTime and NumberDialled.
The analysis I want to perform is "Are there more than two calls to the same number in a 10 second window" The collection is sorted by CallTime already and is a List<Cdr>.
My solution is
List<Cdr> records = GetRecordsSortedByCallTime();
for (int i = 0; i < records.Count; i++)
{
var baseRecord = records[i];
for (int j = i; j < records.Count; j++)
{
var comparisonRec = records[j];
if (comparisonRec.CallTime.Subtract(baseRecord.CallTime).TotalSeconds < 20)
{
if (comparisonRec.NumberDialled == baseRecord.NumberDialled)
ReportProblem(baseRecord, comparisonRec);
}
else
{
// We're more than 20 seconds away from the base record. Break out of the inner loop
break;
}
}
}
Whis is ugly to say the least. Is there a better, cleaner and faster way of doing this?
Although I haven't tested this on a large data set, I will be running it on about 100,000 records per hour so there will be a large number of comparisons for each record.
Update The data is sorted by time not number as in an earlier version of the question
If the phone calls are already sorted by call time, you can do the following:
Initialize a hash table that has a counter for every phone number (the hash table can be first empty and you add elements to it as you go)
Have two pointers to the linked list of yours, let's call them 'left' and 'right'
Whenever the timestamp between the 'left' and 'right' call is less than 10 seconds, move 'right' forwards by one, and increment the count of the newly encountered phone number by one
Whenever the difference is above 10 seconds, move 'left' forwards by one and decrement the count for the phone number from which 'left' pointer left by one
At any point, if there is a phone number whose counter in the hash table is 3 or more, you have found a phone number that has more than 2 calls within a 10 seconds window
This is a linear-time algorithm and processes all the numbers in parallel.
I didn't know you exact structures, so I created my own for this demonstration:
class CallRecord
{
public long NumberDialled { get; set; }
public DateTime Stamp { get; set; }
}
class Program
{
static void Main(string[] args)
{
var calls = new List<CallRecord>()
{
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,0) },
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,9) },
new CallRecord { NumberDialled=123, Stamp=new DateTime(2011,01,01,10,10,18) },
};
var dupCalls = calls.Where(x => calls.Any(y => y.NumberDialled == x.NumberDialled && (x.Stamp - y.Stamp).Seconds > 0 && (x.Stamp - y.Stamp).Seconds <= 10)).Select(x => x.NumberDialled).Distinct();
foreach (var dupCall in dupCalls)
{
Console.WriteLine(dupCall);
}
Console.ReadKey();
}
}
The LINQ expression loops through all records and finds records which are ahead of the current record (.Seconds > 0), and within the time limit (.Seconds <= 10). This might be a bit of a performance hog due to the Any method constantly going over your whole list, but at least the code is cleaner :)
I recommand you to use Rx Extension and the Interval method.
The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs using observable sequences and LINQ-style query operators. Using Rx, developers represent asynchronous data streams with Observables, query asynchronous data streams using LINQ operators, and parameterize the concurrency in the asynchronous data streams using Schedulers
The Interval method returns an observable sequence that produces a value after each period
Here is quick example :
var callsPer10Seconds = Observable.Interval(TimeSpan.FromSeconds(10));
from x in callsPer10Seconds
group x by x into g
let count = g.Count()
orderby count descending
select new {Value = g.Key, Count = count};
foreach (var x in q)
{
Console.WriteLine("Value: " + x.Value + " Count: " + x.Count);
}
records.OrderBy(p => p.CallTime)
.GroupBy(p => p.NumberDialled)
.Select(p => new { number = p.Key, cdr = p.ToList() })
.Select(p => new
{
number = p.number,
cdr =
p.cdr.Select((value, index) => index == 0 ? null : (TimeSpan?)(value.CallTime - p.cdr[index - 1].CallTime))
.FirstOrDefault(q => q.HasValue && q.Value.TotalSeconds < 10)
}).Where(p => p.cdr != null);
In two steps :
Generate an enumeration with the call itself and all calls in the interesting span
Filter this list to find consecutive calls
The computation is done in parallel on each record using the AsParallel extension method.
It is also possible to not call the ToArray at the end and let the computation be done while other code could execute on the thread instead of forcing it to wait for the parallel computation to finish.
var records = new [] {
new { CallTime= DateTime.Now, NumberDialled = 1 },
new { CallTime= DateTime.Now.AddSeconds(1), NumberDialled = 1 }
};
var span = TimeSpan.FromSeconds(10);
// Select for each call itself and all other calls in the next 'span' seconds
var callInfos = records.AsParallel()
.Select((r, i) =>
new
{
Record = r,
Following = records.Skip(i+1)
.TakeWhile(r2 => r2.CallTime - r.CallTime < span)
}
);
// Filter the calls that interest us
var problematic = (from callinfo in callInfos
where callinfo.Following.Any(r => callinfo.Record.NumberDialled == r.NumberDialled)
select callinfo.Record)
.ToArray();
If performance is acceptable (which I think it should be, since 100k records is not particularly many), this approach is (I think) nice and clean:
First we group up the records by number:
var byNumber =
from cdr in calls
group cdr by cdr.NumberDialled into g
select new
{
NumberDialled = g.Key,
Calls = g.OrderBy(cdr => cdr.CallTime)
};
What we do now is Zip (.NET 4) each calls collection with itself-shifted-by-one, to transform the list of call times into a list of gaps between calls. We then look for numbers where there's a gap of at most 10 seconds:
var interestingNumbers =
from g in byNumber
let callGaps = g.Calls.Zip(g.Calls.Skip(1),
(cdr1, cdr2) => cdr2.CallTime - cdr1.CallTime)
where callGaps.Any(ts => ts.TotalSeconds <= 10)
select g.NumberDialled;
Now interestingNumbers is a sequence of the numbers of interest.

Categories

Resources