Using C# & LINQ, how do I speed up a GroupBy then Sum? - c#

Thanks for looking.
I have a data structure that is something like this:
Flight # | Airport | Tickets Sold
----------------------------------
123 | SEA | 4
432 | SFO | 2
9875 | SEA | 1
33 | CLT | 3
In reality, my table has about 20k records and I need to determine which airport has the highest total tickets.
I currently am using the following code which works, but is EXTREMELY slow:
var mostTickets = flights.GroupBy(g => g.OriginAirport.IATA_Code)
.Select(s => new {s.Key, Sum = s.Sum(su => su.Tickets), Airport = s.Select(se => se.OriginAirport)})
.OrderByDescending(o => o.Sum).First();
I have determined that the GroupBy call is quite fast, but everything down stream of it, starting with .Select takes about 15 seconds for about 20,000 records.
It is the GroupBy() that is slow after all.
Is there some way to speed this up?

Have you thought about setting up a view for that?
http://www.w3schools.com/sql/sql_view.asp

Related

Get index array of value changes or first distinct values from array of Objects

I have a sorted array of objects that contains 3 doubles, lets call them x, y and z. I need to create an array of indexes for each transition point, example:
index x y z
------------------
0 | 3 | 2 | 1
1 | 4 | 3 | 1
2 | 3 | 1 | 1
3 | 1 | 3 | 2
4 | 3 | 1 | 2
5 | 1 | 3 | 3
6 | 1 | 3 | 3
Should give me the array {3,5} as that is the point where z changes, I have tried
var ans = myObjectArr.Select(a => a.z).Distinct().ToList();but that simply gives me a list of the values themselves and not the index they are in the object array. The array is very large and i would like to avoid iterating my way through it discreetly. I feel like i'm just missing something silly and any help would be appreciated. Generating a list or an array would both be acceptable.
EDIT: I was under the impression that using Linq was faster than doing it iteratively, after reading comments i find this not to be the case and because of this Charlieface's answer is the best one for my problem.
var lastZ = myObjectArr[0].z;
var zIndexes = new List<int>();
for(var i = 1; i < myObjectArr.Length; i++)
{
if(myObjectArr[i] != lastZ)
{
lastZ = myObjectArr[i];
zIndexes.Add(i);
}
}
// or you can use obscure Linq code if it makes you feel better
var zIndexes = myObjectArr
.Select((o, i) => (o, i))
.Skip(1)
.Aggregate(
new List<int>(),
(list, tuple) => {
if(tuple.o.z != lastZ)
{
lastZ = tuple.o.z;
list.Add(tuple.i);
}
return list;
} );

C# Retrieving the absolute value closest to 0 from a sum in a list

What I'm trying to do is get the Time where Math.Abs(A + B + C) is closest to 0 for each ID. It's quite hefty. I have a list (that I got from a CSV file) that kind of looks like this:
|------------|------------|-------|-------|-------|
| Time | ID | A | B | C |
|------------|------------|-------|-------|-------|
| 100 | 1 | 1 | 2 | 2 |
|------------|------------|-------|-------|-------|
| 100 | 2 | 3 | 4 | 3 |
|------------|------------|-------|-------|-------|
| 200 | 1 | 1 | 0 | 3 |
|------------|------------|-------|-------|-------|
| 200 | 2 | 1 | 2 | 0 |
|------------|------------|-------|-------|-------|
I have the following code, and while it it not complete yet, it technically prints the value that I want. However when I debug it, it doesn't seem to be looping through the IDs, but does it all in one loop. My idea was that I could get all the distinct IDs, then go through the distinct Time for each, put them all in a temporary list, then just use Aggregate to get the value closest to 0. But I feel there should be a more efficient approach than this. Is there a quicker LINQ function I can use to achieve what I want?
for (int i = 0; i < ExcelRecords.Select(x => x.Id).Distinct().Count(); i++)
{
for (int j = 0; j < ExcelRecords.Select(y => y.Time).Distinct().Count(); j++)
{
List<double> test = new List<double>();
var a = ExcelRecords[j].a;
var b = ExcelRecords[j].b;
var c = ExcelRecords[j].c;
test.Add(Math.Abs(pitch + yaw + roll));
Console.WriteLine(Math.Abs(pitch + yaw + roll));
}
}
Using explicit for loops like this is throwing the power and flexibility of LINQ out the window. LINQ's power comes from enumerators which can be used on their own or in conjunction with a foreach loop.
You are also doing way more work than necessary using Distinct to first get a list and then using that list to selectively group the rows into batches. This is what GroupBy was designed for.
var times = ExcelRecords.GroupBy((row) => row.Id)
.Select((g) => g.Aggregate((min, row) => Math.Abs(row.a + row.b + row.c) < Math.Abs(min.a + min.b + min.c) ? row : min).Time)
.ToList();
I think the most readable way would be to group by Id, order each group and select the first Time from each:
var times = ExcelRecords.GroupBy(x => x.Id)
.Select(grp => grp.OrderBy(item => Math.Abs(item.A + item.B + item.C)))
.Select(grp => grp.First().Time);
Working Example

Query a list object by a list int

I have a list of object (with id) and a list int, what is the best way to query that list object provided the list int.
class CommonPartRecord {
public int Id {get;set;}
public object Others {get;set;}
}
var listObj = new List<CommonPartRecord>();
// Load listObj
var listId = new List<int>();
// Load listId
Now select the listObj, for those Id is contained in listId, I currently do this way:
var filterItems = listObj.Where(x => listId.Contains(x.Id));
What would be the faster way to perform this?
Thanks,
Huy
var tmpList = new HashSet<int>(listId);
var filterItems = listObj.Where(x => tmpList.Contains(x.Id));
This could give you a performance boost or a performance drop, it depends heavily on the size of both listObj and listId.
You will need to profile it to see if you get any improvement from it.
Explaining the boost or drop:
I am going to use some really exagrated numbers to make the math easier, lets say the following
listObj.Where(x => listId.Contains(x.Id)); takes 5 seconds per row.
listObj.Where(x => tmpList.Contains(x.Id)) takes 2 seconds per row.
var tmpList = new HashSet<int>(listId); takes 10 seconds to build.
Lets plot out the times of how long it would take to process the data varying by the number of rows in listObj
+----------------+------------------------------+----------------------------------+
| Number of rows | Seconds to process with list | Seconds to process with hash set |
+----------------+------------------------------+----------------------------------+
| 1 | 5 | 12 |
| 2 | 10 | 14 |
| 3 | 15 | 16 |
| 4 | 20 | 18 |
| 5 | 25 | 20 |
| 6 | 30 | 22 |
| 7 | 35 | 24 |
| 8 | 40 | 26 |
+----------------+------------------------------+----------------------------------+
So you can see if listObj has 1 to 3 rows your old way is faster, however once you have 4 rows or more the new way is faster.
(Note I totally made these numbers up, I can guarantee that the per row for HashSet is faster than per row for List, but I can not tell you how much faster. You will need to test to see if the point you get better performance is at 4 rows or 4,000 rows. The only way to know is try both ways and test.)

Best data structure for caching objects with a composite unique id

I have a slow function that makes an expensive trip to the server to retrieve RecordHdr objects. These objects are sorted by rid first and then by aid. They are then returned in batches of 5.
| rid | aid |
-------------->
| 1 | 1 | >
| 1 | 3 | >
| 1 | 5 | > BATCH of 5 returned
| 1 | 6 | >
| 2 | 2 | >
-------------->
| 2 | 3 |
| 2 | 4 |
| 3 | 1 |
| 3 | 2 |
| 3 | 5 |
| 3 | 6 |
| 4 | 1 |
| 4 | 2 |
| 4 | 5 |
| 4 | 6 |
After I retrieve the objects, I have to wrap them in another class called WrappedRecordHdr. I'm wondering what is the best data structure I can use to maintain a cache of WrappedRecordHdr objects such that if I'm asked for an object by rid and aid, I return a particular object for it. Also if I'm asked for the rid, I should return all objects that have that rid.
So far I have created two structures for each scenario (This may not be the best way, but It's what I'm using for now):
// key: (rid, aid)
private CacheMap<int, int, WrappedRecordHdr> m_ridAidCache =
new CacheMap<int, int, WrappedRecordHdr>();
// key: (rid)
private CacheMap<int, WrappedRecordHdr[]> m_ridCache =
new CacheMap<int, WrappedRecordHdr[]>();
Also, I'm wondering if there is a way I can rewrite this to be more efficient. Right now I have to get a number of records that I need to wrap within another object. Then, I need to group them in a dictionary by id so that if I am asked for a certain rid I can return all objects that have the same rid. The records have been already sorted, so I'm hoping the GroupBy doesn't attempt to sort them beforehand.
RecordHdr[] records = server.GetRecordHdrs(sessId, BATCH_SIZE) // expensive call to server.
// After all RecordHdr objects are retrieved, we loop through the received objects. For each RecordHdr object a WrappedRecordHdr object has to be created.
WrappedRecordHdr[] wrappedRecords = new WrappedRecordHdr[records.Length];
for (int i = 0; i < wrappedRecords.Length; i++)
{
if (records[i] == null || records[i].aid == 0 || records[i].rid == 0) continue; // skip invalid results.
wrappedRecords[i] = new WrappedRecordHdr(AccessorManager, records[i], projectId);
}
// Group all records found in a dictionary of rid => array of WrappedRecordHdrs, so all records with the same
// rid are returned.
objects associated to a particular rid.
Dictionary<int, WrappedRecordHdr[]> dict = wrappedRecords.GroupBy(obj => obj.rid).ToDictionary(gdc => gdc.Key, gdc => gdc.ToArray());
m_ridCache = dict;
As to the data structure, I think there are really two different questions here:
What structure to use;
Should there be one or two caches;
It seems to me that you want one cache, typed as a MemoryCache. The key would be the RID, and the value would be a Dictionary, where the key is an AID and the value is the header.
This has the following advantages:
The WrappedRecordHdrs are stored only once;
The MemoryCache already has all of the caching logic implemented, so you don't need to rewrite that;
When provided with only an RID, you know the AID of each WrappedRecordHdr (which you don't get with the array in the initial post);
These things are always compromises, so this has disadvantages too of course:
Cache access (get or set) requires constructing a string each time;
RID + AID lookups require indexing twice (as opposed to writing some fast hashing function that takes an RID and AID and returns a single key into the cache, however that would require that you either have two caches (one RID only, one RID + AID) or that you store the same WrappedRecordHdr twice per AID (once for RID + AID and once for null + AID));

Packing items vertically (with fixed length and horizontal position)

Let's say i have some items, that have a defined length and horizontal position (both are constant) :
1 : A
2 : B
3 : CC
4 : DDD (item 4 start at position 1, length = 3)
5 : EE
6 : F
I'd like to pack them vertically, resulting in a rectangle having smallest height as possible.
Until now, I have some very simple algorithm that loops over the items and that check row by row if placing them in that row is possible (that means without colliding with something else). Sometimes, it works perfectly (by chance) but sometimes, it results in non-optimal solution.
Here is what it would give for the above example (step by step) :
A | A B | ACC B | ACC B | ACC B | ACC B |
DDD | DDD | FDDD |
EE | EE |
While optimal solution would be :
ADDDB
FCCEE
Note : I have found that sorting items by their length (descending order) first, before applying algorithm, give better results (but it is still not perfect).
Is there any algorithm that would give me optimal solution in reasonable time ? (trying all possibilities is not feasible)
EDIT : here is an example that would not work using sorting trick and that would not work using what TylerOhlsen suggested (unless i dont understand his answer) :
1 : AA
2 : BBB
3 : CCC
4 : DD
Would give :
AA BBB
CCC
DD
Optimal solution :
DDBBB
AACCC
Just spitballing (off the top of my head and just pseudocode ). This algorithm is looping through positions of the current row and attempts to find the best item to place at the position and then moves on to the next row when this row completes. The algorithm completes when all items are used.
The key to the performance of this algorithm is creating an efficient method which finds the longest item at a specific position. This could be done by creating a dictionary (or hash table) of: key=positions, value=sorted list of items at that position (sorted by length descending). Then finding the longest item at a position is as simple as looking up the list of items by position from that hash table and popping the top item off that list.
int cursorRow = 0;
int cursorPosition = 0;
int maxRowLength = 5;
List<Item> items = //fill with item list
Item[][] result = new Item[][];
while (items.Count() > 0)
(
Item item = FindLongestItemAtPosition(cursorPosition);
if (item != null)
{
result[cursorRow][cursorPosition] = item;
items.Remove(item);
cursorPosition += item.Length;
}
else //No items remain with this position
{
cursorPosition++;
}
if (cursorPosition == maxRowLength)
{
cursorPosition = 0;
cursorRow++;
}
}
This should result in the following steps for Example 1 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=1 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| A | ADDD | ADDDB | ADDDB | ADDDB | ADDDB |
F | FCC | FCCEE |
This should result in the following steps for Example 2 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=2 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| AA | AACCC | AACCC | AACCC | AACCC | AACCC |
DD | DDBBB |
This is a classic Knapsack Problem. As #amit said, it is NP-Complete. The most efficient solution makes use of Dynamic Programming to solve.
The Wikipedia page is a very good start. I've never implemented any algorithm to solve this problem, but I've studied it relation with the minesweeper game, which is also NP-Complete.
Wikipedia: Knapsack Problem

Categories

Resources