I have linq expression "Where" that may returns several rows:
var checkedPrices = prices.Where(...).ToList();
As there are several rows, retrieves from db => i want to take the largest string from this list of rows.
Also there is a case when one of the fields may have same lenght, so i tried to find the largest from another field.
int countPrices = checkedPrices.Count();
if (checkedPrices == 0)
{
checkedPrices = null;
}
else if (checkedPrices == 1)
{
checkedPrices = checkedPrices.Take(1).ToList();
}
else if (countFixedPrices > 1)
{
var maxPrices1 = checkedPrices.Max(i => i.Field1.Length);
if (maxPrices1 > 1)
{
var maxPrices2 = checkedPrices.Max(i => i.Field2.Length);
checkedPrices = checkedPrices.IndexOf(maxPrices2 );
}
checkedPrices = checkedPrices .ElementAt(maxPrices2);
}
So, i have an issue in the last "else if".
My logic was to find the max largest of Field1.
If there is the only one largest field - rewrite it to the "Where" expression (checkedPrices).
If there is not only one max largest of Field1 => take the largest from Field2.
The problem of mine is i'm confused how could i take the row data from the largest Field1/Field2.
This part of code is ridiculously bad(doesnt even compile):
if (maxPrices1 > 1)
{
var maxPrices2 = checkedPrices.Max(i => i.Field2.Length);
checkedPrices = checkedPrices.IndexOf(maxPrices2 );
}
checkedPrices = checkedPrices .ElementAt(maxPrices2);
Since it seems you need only one price I would recommend just write correct query to fetch it only. You can order items (with LINQ's OrderByDescending and ThenByDescending) and the take the top one:
var checkedPrice = prices
.Where(...)
.OrderByDescending(c => c.Field1.Length)
.ThenByDescending(c => c.Field2.Length)
.FirstOrDefault();
P.S:
For LINQ-to-Objects this solution can be inefficient for large datasets after filtering due to sorting being O(n * log(n)) operation while finding maximum element is O(n) task.
There can be implementation depended LINQ optimizations for some of cases like combination of OrderBy(Descending) with some overloads of operators like First(OrDefault) and possibly Skip/Take (see one, two, three).
Related
I am attempting to create batches of 100 records i can delete from Azure Table Storage. I found a great article on efficiently creating batches to delete table records here: https://blog.bitscry.com/2019/03/25/efficiently-deleting-rows-from-azure-table-storage/
and have followed this along. the issue i am facing that is different from the example in this blog post is that my deletes will have different partition keys. so rather than simply splitting my results into batches of 100 (as it does in the example) i first need to split them into groups of like partition keys, and THEN examine those lists, and further sub-divide them if the count is greater than 100 (as Azure recommends only batches of 100 records at a time, and they all require the same partition key)
Let me say i am TERRIBLE with enumerable LINQ and the non-query style that is described in this blog post so i'm a bit lost. i have written a small work around that does create these batches by the partition ID, and the code works to delete them, i just am not handling the possibility that there may be more than 100 rows to delete based on the partition key. So the code below is just used as an example to show you how i approached splitting the updates by partition key.
List<string> partitionKeys = toDeleteEntities.Select(x => x.PartitionKey).Distinct().ToList();
List<List<DynamicTableEntity>> chunks = new List<List<DynamicTableEntity>>();
for (int i = 0; i < partitionKeys.Count; ++i)
{
var count = toDeleteEntities.Where(x => x.PartitionKey == partitionKeys[i]).Count();
//still need to figure how to split by groups of 100.
chunks.Add(toDeleteEntities.Distinct().Where(x=>x.PartitionKey == partitionKeys[i]).ToList());
}
i have tried to do multiple groupby statements in a linq function similar to this
// Split into chunks of 100 for batching
List<List<TableEntity>> rowsChunked = tableQueryResult.Result.Select((x, index) => new { Index = index, Value = x })
.Where(x => x.Value != null)
.GroupBy(x => x.Index / 100)
.Select(x => x.Select(v => v.Value).ToList())
.ToList();
but once i add a second set parameters to group by (eg: x=>x.PartitionKey) then my select below starts to go pear shaped. The end result object is a LIST of LISTS that contain DyanmicTableEntities and an index
[0]
[0][Entity]
[1][Entity]
...
[99][Entity]
[1]
[0][Entity]
[1][Entity]
...
i hope this makes sense, if not please feel free to ask for clarification.
Thanks in advance.
EDIT FOR CLARIFICATION:
The idea is simply that i want to group by PARTITION Key AND only take 100 rows before creating another row of the SAME partition key and adding the rest of the rows
thanks,
One useful LINQ method you can use is GroupBy(keySelector). It basically divides your collection into groups based on a selector. So in your case, you'd probably want to group by PartitionKey:
var partitionGroups = toDeleteEntities.GroupBy(d => d.PartitionKey);
When you iterate through this collection, you'll get an IGrouping. Finally, to get the correct batch, you can use Skip(int count) and Take(int count)
foreach (var partitionGroup in partitionGroups)
{
var partitionKey = partitionGroup.Key;
int startPosition = 0;
int count = partitionGroup.Count();
while(count > 0)
{
int batchSize = count % maxBatchSize > 0 ? count % maxBatchSize : maxBatchSize;
var partitionBatch = partitionGroup.Skip(startPosition).Take(batchSize);
// process your batches here
chunks.Add(new List<DynamicTableEntry>(partitionBatch));
startPosition += batchSize;
count = count - batchSize;
}
}
I'm creating a report generating tool that use custom data type of different sources from our system. The user can create a report schema and depending on what asked, the data get associated based different index keys, time, time ranges, etc. The project is NOT doing queries in a relational database, it's pure C# code in collections from RAM.
I'm having a huge performance issue and I'm looking at my code since a few days and struggle with trying to optimize it.
I stripped down the code to the minimum for a short example of what the profiler point as the problematic algorithm, but the real version is a bit more complex with more conditions and working with dates.
In short, this function return a subset of "values" satisfying the conditions depending on the keys of the values that were selected from the "index rows".
private List<LoadedDataSource> GetAssociatedValues(IReadOnlyCollection<List<LoadedDataSource>> indexRows, List<LoadedDataSource> values)
{
var checkContainers = ((ValueColumn.LinkKeys & ReportLinkKeys.ContainerId) > 0 &&
values.Any(t => t.ContainerId.HasValue));
var checkEnterpriseId = ((ValueColumn.LinkKeys & ReportLinkKeys.EnterpriseId) > 0 &&
values.Any(t => t.EnterpriseId.HasValue));
var ret = new List<LoadedDataSource>();
foreach (var value in values)
{
var valid = true;
foreach (var index in indexRows)
{
// ContainerId
var indexConservedSource = index.AsEnumerable();
if (checkContainers && index.CheckContainer && value.ContainerId.HasValue)
{
indexConservedSource = indexConservedSource.Where(t => t.ContainerId.HasValue && t.ContainerId.Value == value.ContainerId.Value);
if (!indexConservedSource.Any())
{
valid = false;
break;
}
}
//EnterpriseId
if (checkEnterpriseId && index.CheckEnterpriseId && value.EnterpriseId.HasValue)
{
indexConservedSource = indexConservedSource.Where(t => t.EnterpriseId.HasValue && t.EnterpriseId.Value == value.EnterpriseId.Value);
if (!indexConservedSource.Any())
{
valid = false;
break;
}
}
}
if (valid)
ret.Add(value);
}
return ret;
}
This works for small samples, but as soon as I have thousands of values, and 2-3 index rows with a few dozens values too, it can take hours to generate.
As you can see, I try to break as soon as a index condition fail and pass to the next value.
I could probably do everything in a single "values.Where(####).ToList()", but that condition get complex fast.
I tried generating a IQueryable around indexConservedSource but it was even worse. I tried using a Parallel.ForEach with a ConcurrentBag for "ret", and it was also slower.
What else can be done?
What you are doing, in principle, is calculating intersection of two sequences. You use two nested loops and that is slow as the time is O(m*n). You have two other options:
sort both sequences and merge them
convert one sequence into hash table and test the second against it
The second approach seems better for this scenario. Just convert those index lists into HashSet and test values against it. I added some code for inspiration:
private List<LoadedDataSource> GetAssociatedValues(IReadOnlyCollection<List<LoadedDataSource>> indexRows, List<LoadedDataSource> values)
{
var ret = values;
if ((ValueColumn.LinkKeys & ReportLinkKeys.ContainerId) > 0 &&
ret.Any(t => t.ContainerId.HasValue))
{
var indexes = indexRows
.Where(i => i.CheckContainer)
.Select(i => new HashSet<int>(i
.Where(h => h.ContainerId.HasValue)
.Select(h => h.ContainerId.Value)))
.ToList();
ret = ret.Where(v => v.ContainerId == null
|| indexes.All(i => i.Contains(v.ContainerId)))
.ToList();
}
if ((ValueColumn.LinkKeys & ReportLinkKeys.EnterpriseId) > 0 &&
ret.Any(t => t.EnterpriseId.HasValue))
{
var indexes = indexRows
.Where(i => i.CheckEnterpriseId)
.Select(i => new HashSet<int>(i
.Where(h => h.EnterpriseId.HasValue)
.Select(h => h.EnterpriseId.Value)))
.ToList();
ret = ret.Where(v => v.EnterpriseId == null
|| indexes.All(i => i.Contains(v.EnterpriseId)))
.ToList();
}
return ret;
}
I would like to do something like this (below) but not sure if there is a formal/optimized syntax to do so?
.Orderby(i => i.Value1)
.Take("Bottom 100 & Top 100")
.Orderby(i => i.Value2);
basically, I want to sort by one variable, then take the top 100 and bottom 100, and then sort those results by another variable.
Any suggestions?
var sorted = list.OrderBy(i => i.Value);
var top100 = sorted.Take(100);
var last100 = sorted.Reverse().Take(100);
var result = top100.Concat(last100).OrderBy(i => i.Value2);
I don't know if you want Concat or Union at the end. Concat will combine all entries of both lists even if there are similar entries which would be the case if your original list contains less than 200 entries. Union would only add stuff from last100 that is not already in top100.
Some things that are not clear but that should be considered:
If list is an IQueryable to a db, it probably is advisable to use ToArray() or ToList(), e.g.
var sorted = list.OrderBy(i => i.Value).ToArray();
at the beginning. This way only one query to the database is done while the rest is done in memory.
The Reverse method is not optimized the way I hoped for, but it shouldn't be a problem, since ordering the list is the real deal here. For the record though, the skip method explained in other answers here is probably a little bit faster but needs to know the number of elements in list.
If list would be a LinkedList or another class implementing IList, the Reverse method could be done in an optimized way.
You can use an extension method like this:
public static IEnumerable<T> TakeFirstAndLast<T>(this IEnumerable<T> source, int count)
{
var first = new List<T>();
var last = new LinkedList<T>();
foreach (var item in source)
{
if (first.Count < count)
first.Add(item);
if (last.Count >= count)
last.RemoveFirst();
last.AddLast(item);
}
return first.Concat(last);
}
(I'm using a LinkedList<T> for last because it can remove items in O(1))
You can use it like this:
.Orderby(i => i.Value1)
.TakeFirstAndLast(100)
.Orderby(i => i.Value2);
Note that it doesn't handle the case where there are less then 200 items: if it's the case, you will get duplicates. You can remove them using Distinct if necessary.
Take the top 100 and bottom 100 separately and union them:
var tempresults = yourenumerable.OrderBy(i => i.Value1);
var results = tempresults.Take(100);
results = results.Union(tempresults.Skip(tempresults.Count() - 100).Take(100))
.OrderBy(i => i.Value2);
You can do it with in one statement also using this .Where overload, if you have the number of elements available:
var elements = ...
var count = elements.Length; // or .Count for list
var result = elements
.OrderBy(i => i.Value1)
.Where((v, i) => i < 100 || i >= count - 100)
.OrderBy(i => i.Value2)
.ToArray(); // evaluate
Here's how it works:
| first 100 elements | middle elements | last 100 elements |
i < 100 i < count - 100 i >= count - 100
You can write your own extension method like Take(), Skip() and other methods from Enumerable class. It will take the numbers of elements and the total length in list as input. Then it will return first and last N elements from the sequence.
var result = yourList.OrderBy(x => x.Value1)
.GetLastAndFirst(100, yourList.Length)
.OrderBy(x => x.Value2)
.ToList();
Here is the extension method:
public static class SOExtensions
{
public static IEnumerable<T> GetLastAndFirst<T>(
this IEnumerable<T> seq, int number, int totalLength
)
{
if (totalLength < number*2)
throw new Exception("List length must be >= (number * 2)");
using (var en = seq.GetEnumerator())
{
int i = 0;
while (en.MoveNext())
{
i++;
if (i <= number || i >= totalLength - number)
yield return en.Current;
}
}
}
}
I have this query that gives the correct results but it takes about 15 seconds to run
int Count= P.Pets.Where(c => !P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName).Contains(c.OwnerName) && c.CreatedDate >=
EntityFunctions.AddDays(DateTime.Now, -8)).GroupBy(b=>b.OwnerName).Count();
If I remove this part of the linq
'&& c.CreatedDate >= EntityFunctions.AddHours(DateTime.Now, -8)'
It only takes about 3 seconds to run. How can I keep the same condition happening but a lot faster?
I need that date criteria because I don't want any Classeses that were created 8 days old to be included in the count
Edit
I have a table by the name of People which is referred to in this query as P and I want to return a count of the total of Pets they are that do not have a owner and remove the ones from the query that don't do have an owner even if they exist in another Pet reference has not the owner of that Pet. Meaning if a person has at least one record in the Pets table to be considered as an owner of a pet than I want to remove all cases where that person exist in the return query and once that is done only return the Pets that have been created newer than 8 days
You should cache the date and put that evaluation first (since the DateTime evaluation should be faster than a Contains evaluation). Also avoid recalculating the same query multiple times.
DateTime eightDaysOld = EntityFunctions.AddHours(DateTime.Now, -8);
//calculate these independently from the rest of the query
var ownedPetOwnerNames = P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName);
//Evaluate the dates first, it should be
//faster than Contains()
int Count = P.Pets.Where(c => c.CreatedDate >= eightDaysOld &&
//Using the cached result should speed this up
ownedPetOwnerNames.Contains(c.OwnerName))
.GroupBy(b=>b.OwnerName).Count();
That should return the same results. (I hope)
You are loosing any ability to use indices with that snippet, as it calculates that static date for every row. Declare a DateTime variable before your query and set it to DateTime.Now.AddHours(-8) and use the variable instead of your snippet in the where clause.
By separating the query and calling ToList() on it and inserting it in the master query make it go 4 times faster
var ownedPetOwnerNames = P.Pets.Where(a => a.IsOwned == true)
.Select(a => a.OwnerName).ToList();
int Count = P.Pets.Where(c => c.CreatedDate >= Date&&
ownedPetOwnerNames.Contains(c.OwnerName)).GroupBy(b=>b.OwnerName).Count();
You could use (and maybe first create) a navigation property Pet.Owner:
var refDate = DateTime.Today.AddDays(-8);
int Count= P.Pets
.Where(p => !p.Owner.Pets.Any(p1 => p1.IsOwned)
&& p.CreatedDate >= refDate)
.GroupBy(b => b.OwnerName).Count();
This may increase performance because the Contains is gone. At least it is better scalable than your two-phase query with a Contains involving an unpredictable number of strings.
Of course you also need to make sure there is an index on CreatedDate.
Right now, I have 2 lists(_pumpOneSpm and _pumpTwoSpm) that are the result of a database query. Now, I've been using LINQ to populate lists for me to work with for calculations. Something similar to this:
var spmOne = _pumpOneSpm.Where((t, i) => _date[i].Equals(date) &&
DateTime.Compare(_time[i], DateTime.Parse(start)) > 0 &&
DateTime.Compare(_time[i], DateTime.Parse(end)) < 0).ToList();
This has been fine, because I've been pulling data from one list to add specific data into another list. But now, I need pull data from two lists to add to one list. I'm curious if there is a way to do this with LINQ, or if I should just iterate through with a for loop? Here's what I've got with the for loop:
for (var i = 0; i < _pumpOneSpm.Count; i++)
{
if (_date[i].Equals(date))
{
if (DateTime.Compare(_time[i], DateTime.Parse(start)) > 0)
{
if (DateTime.Compare(_time[i], DateTime.Parse(end)) < 0)
{
spmOne.Add(_pumpOneSpm[i]);
spmOne.Add(_pumpTwoSpm[i]);
}
}
}
}
This works, and does what I want. But I'm just trying to make it a bit more efficient and faster. I could use two lines similar to the first code block for each list, but that means I'm iterating through twice. So, to reiterate my question, is it possible to use a LINQ command to pull data from two lists at the same time to populate another list, or should I stick with my for loop? Thanks for any and all help.
EDIT
I failed to mention, that the purpose of the function utilizing these steps is to fine the MAX of _pumpOneSpm & _pumpTwoSpm. Not the max of each, but the max between them. Thus initially I was adding them into a single list and just calling spmOne.Max(). But with two LINQ queries like above, I'm running an if statement to compare the two maxes, and then return the greater of the two. So, technically they don't NEED to be in one list, I just thought it'd be easier to handle if they were.
So the simplest way to do this would be something like this:
var resultList = list1.Concat(list2).ToList();
Note that this will be slightly different than your for loop (the order of the items won't be the same). It will have all items from one list followed by all items from another, rather than having the first of the first, the first of the second, etc. If that's important you could still do it with LINQ, but it would take a few extra method calls.
You've mentioned performance; it's worth noting that you aren't going to get significantly better, performance wise, than your for loop. You could get code that's shorter; code that's more readable, and that will perform about the same. There simply isn't room to improve very much more though in terms of runtime speed.
It's also worth noting that you can combine the two method; it's not even a bad idea. You can use LINQ to filter the two input sequences (using Where) and then foreach over the results to add them to the list (or use AddRange).
var query1 = _pumpOneSpm.Where((t, i) => _date[i].Equals(date) &&
DateTime.Compare(_time[i], DateTime.Parse(start)) > 0 &&
DateTime.Compare(_time[i], DateTime.Parse(end)) < 0);
var query2 = ...
List<T> results = new List<T>(query1);
results.AddRange(query2);
So, to reiterate my question, is it possible to use a LINQ command to
pull data from two lists at the same time to populate another list, or
should I stick with my for loop?
Merge the lists:
var mergedList = list1.Union(list2).ToList();
Optionally if you want, then filter:
var filtered = mergedList.Where( p => ... filter ... );
And for the Max:
var max = filtered.Max();
NOTE:
As OP says there's no need from order not for not-duplicates so Union will be OK.
You can use the SelectMany method
http://msdn.microsoft.com/en-us/library/bb548748.aspx
You can achieve similar functionality to the for loop with Enumerable.Range. SelectMany then allows you to add multiple items each iteration and then flatten the resulting sequence.
spmOne = Enumerable
.Range(0, _pumpOneSpm.Count)
.Where(index => _date[index].Equals(date) &&
DateTime.Compare(_time[index], DateTime.Parse(start)) > 0 &&
DateTime.Compare(_time[index], DateTime.Parse(end)) < 0)
.SelectMany(index => new [] { _pumpOneSpm[index], _pumpTwoSpm[index] })
.ToList();
Here is the query syntax version:
spmOne = (from index in Enumerable.Range(0, _pumpOneSpm.Count)
where _date[index].Equals(date) &&
DateTime.Compare(_time[index], DateTime.Parse(start)) > 0 &&
DateTime.Compare(_time[index], DateTime.Parse(end)) < 0
from pumpSpm in new [] { _pumpOneSpm[index], _pumpTwoSpm[index] }
select pumpSpm).ToList();
If you want to avoid using indexes you can Zip everything together.
spmOne = _pumpOneSpm
.Zip(_pumpTwoSpm, (pump1, pump2) => new { pump1, pump2 } )
.Zip(_date, (x, pumpDate) => new { x.pump1, x.pump2, date = pumpDate })
.Zip(_time, (x, time) => new { x.pump1, x.pump2, x.date, time })
.Where(x => x.date.Equals(date) &&
DateTime.Compare(x.time, DateTime.Parse(start)) > 0 &&
DateTime.Compare(x.time, DateTime.Parse(end)) < 0)
.SelectMany(x => new [] { x.pump1, x.pump2 })
.ToList();
There is one small difference in functionality: if any of the sequences are shorter than _pumpOneSpm, the resulting sequence will be the length of the shortest sequence, and no exception will be thrown about an index being out of range.