Query a list object by a list int - c#

I have a list of object (with id) and a list int, what is the best way to query that list object provided the list int.
class CommonPartRecord {
public int Id {get;set;}
public object Others {get;set;}
}
var listObj = new List<CommonPartRecord>();
// Load listObj
var listId = new List<int>();
// Load listId
Now select the listObj, for those Id is contained in listId, I currently do this way:
var filterItems = listObj.Where(x => listId.Contains(x.Id));
What would be the faster way to perform this?
Thanks,
Huy

var tmpList = new HashSet<int>(listId);
var filterItems = listObj.Where(x => tmpList.Contains(x.Id));
This could give you a performance boost or a performance drop, it depends heavily on the size of both listObj and listId.
You will need to profile it to see if you get any improvement from it.
Explaining the boost or drop:
I am going to use some really exagrated numbers to make the math easier, lets say the following
listObj.Where(x => listId.Contains(x.Id)); takes 5 seconds per row.
listObj.Where(x => tmpList.Contains(x.Id)) takes 2 seconds per row.
var tmpList = new HashSet<int>(listId); takes 10 seconds to build.
Lets plot out the times of how long it would take to process the data varying by the number of rows in listObj
+----------------+------------------------------+----------------------------------+
| Number of rows | Seconds to process with list | Seconds to process with hash set |
+----------------+------------------------------+----------------------------------+
| 1 | 5 | 12 |
| 2 | 10 | 14 |
| 3 | 15 | 16 |
| 4 | 20 | 18 |
| 5 | 25 | 20 |
| 6 | 30 | 22 |
| 7 | 35 | 24 |
| 8 | 40 | 26 |
+----------------+------------------------------+----------------------------------+
So you can see if listObj has 1 to 3 rows your old way is faster, however once you have 4 rows or more the new way is faster.
(Note I totally made these numbers up, I can guarantee that the per row for HashSet is faster than per row for List, but I can not tell you how much faster. You will need to test to see if the point you get better performance is at 4 rows or 4,000 rows. The only way to know is try both ways and test.)

Related

Get index array of value changes or first distinct values from array of Objects

I have a sorted array of objects that contains 3 doubles, lets call them x, y and z. I need to create an array of indexes for each transition point, example:
index x y z
------------------
0 | 3 | 2 | 1
1 | 4 | 3 | 1
2 | 3 | 1 | 1
3 | 1 | 3 | 2
4 | 3 | 1 | 2
5 | 1 | 3 | 3
6 | 1 | 3 | 3
Should give me the array {3,5} as that is the point where z changes, I have tried
var ans = myObjectArr.Select(a => a.z).Distinct().ToList();but that simply gives me a list of the values themselves and not the index they are in the object array. The array is very large and i would like to avoid iterating my way through it discreetly. I feel like i'm just missing something silly and any help would be appreciated. Generating a list or an array would both be acceptable.
EDIT: I was under the impression that using Linq was faster than doing it iteratively, after reading comments i find this not to be the case and because of this Charlieface's answer is the best one for my problem.
var lastZ = myObjectArr[0].z;
var zIndexes = new List<int>();
for(var i = 1; i < myObjectArr.Length; i++)
{
if(myObjectArr[i] != lastZ)
{
lastZ = myObjectArr[i];
zIndexes.Add(i);
}
}
// or you can use obscure Linq code if it makes you feel better
var zIndexes = myObjectArr
.Select((o, i) => (o, i))
.Skip(1)
.Aggregate(
new List<int>(),
(list, tuple) => {
if(tuple.o.z != lastZ)
{
lastZ = tuple.o.z;
list.Add(tuple.i);
}
return list;
} );

Why Thread faster than Parallel.Foreach to open OracleConnection?

I have 2 pieces of code as follows, one use Thread and one use Parallel.Foreach
Thread
foreach (var i in new int [] {0, 1, 2, 3, 4, ..., 24 })
new Thread(GET_DATA).Start(i);
Parallel
Parallel.Foreach(new int [] {0, 1, 2, 3, 4, ..., 24 }, GET_DATA);
with method GET_DATA
void GET_DATA(object state) {
var x = (int)state;
using (var conn = new OracleConnection(cs[x])) {
conn.Open();
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = "select * from dual";
var dt = new DataTable();
dt.Load(cmd.ExecuteReader());
}
}
}
and cs is connection string array for 25 Oracle Database
I use Intel(R) Core(TM) i5-6200U CPU # 2.30GHz with 16GB RAM, OracleConnection in Oracle.ManagedDataAccess library from nuget.
At first time, when no Connection Pool is created
Thread approach: all thread stop at 3 to 5 seconds after run
Parallel approach: parallel stop at 12 to 15 seconds after run
At next times, results of the 2 approaches are similar because Connection Pools are created.
I guess parallel running on single-core should be nearly 4 times slower than thread, can anyone explain it?
Thank you
In your first example you spawn 24 threads, and they will each start by waiting on an I/O operation (establishing socket connection to database) to complete.
In the second example you are using an unknown number of threads (depends on the number of cores, and degree of parallelization settings) to process a list of 24 items. Each thread will process a subset of the items in serial.
Since the operation is not CPU-bound in your process, but rather dependent on external processes (I/O to the database, operations on the database, etc.), the Parallel.Foreach will waste a lot of time waiting for one task to finish before starting the next.
Example
X is the completion of an operation, time goes vertically, threads go horizontally.
When using 24 threads:
1 2 3 ... 24 Time
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
X X X X V
When using 4 threads to process 24 items:
1 2 3 4 Time
| | | | |
| | | | |
| | | | |
X X X X |
| | | | |
| | | | |
| | | | |
X X X X |
| | | | |
| | | | |
| | | | |
X X X X |
. . . . |
. . . . |
. . . . V
Parallel.Foreach(new int [] {0, 1, 2, 3, 4, ..., 24 }, GET_DATA);
Please also add this to Parallel.Foreach to make the two samples use the same number of thread.
new ParallelOptions { MaxDegreeOfParallelism = 24 }
As for normal foreach, This will wait for all threads to finish before exiting the loop
foreach (var i in new int [] {0, 1, 2, 3, 4, ..., 24 })
new Thread(GET_DATA).Start(i);
I believe you will get the same result if you wait for all thread to be finish.

Using C# & LINQ, how do I speed up a GroupBy then Sum?

Thanks for looking.
I have a data structure that is something like this:
Flight # | Airport | Tickets Sold
----------------------------------
123 | SEA | 4
432 | SFO | 2
9875 | SEA | 1
33 | CLT | 3
In reality, my table has about 20k records and I need to determine which airport has the highest total tickets.
I currently am using the following code which works, but is EXTREMELY slow:
var mostTickets = flights.GroupBy(g => g.OriginAirport.IATA_Code)
.Select(s => new {s.Key, Sum = s.Sum(su => su.Tickets), Airport = s.Select(se => se.OriginAirport)})
.OrderByDescending(o => o.Sum).First();
I have determined that the GroupBy call is quite fast, but everything down stream of it, starting with .Select takes about 15 seconds for about 20,000 records.
It is the GroupBy() that is slow after all.
Is there some way to speed this up?
Have you thought about setting up a view for that?
http://www.w3schools.com/sql/sql_view.asp

Best data structure for caching objects with a composite unique id

I have a slow function that makes an expensive trip to the server to retrieve RecordHdr objects. These objects are sorted by rid first and then by aid. They are then returned in batches of 5.
| rid | aid |
-------------->
| 1 | 1 | >
| 1 | 3 | >
| 1 | 5 | > BATCH of 5 returned
| 1 | 6 | >
| 2 | 2 | >
-------------->
| 2 | 3 |
| 2 | 4 |
| 3 | 1 |
| 3 | 2 |
| 3 | 5 |
| 3 | 6 |
| 4 | 1 |
| 4 | 2 |
| 4 | 5 |
| 4 | 6 |
After I retrieve the objects, I have to wrap them in another class called WrappedRecordHdr. I'm wondering what is the best data structure I can use to maintain a cache of WrappedRecordHdr objects such that if I'm asked for an object by rid and aid, I return a particular object for it. Also if I'm asked for the rid, I should return all objects that have that rid.
So far I have created two structures for each scenario (This may not be the best way, but It's what I'm using for now):
// key: (rid, aid)
private CacheMap<int, int, WrappedRecordHdr> m_ridAidCache =
new CacheMap<int, int, WrappedRecordHdr>();
// key: (rid)
private CacheMap<int, WrappedRecordHdr[]> m_ridCache =
new CacheMap<int, WrappedRecordHdr[]>();
Also, I'm wondering if there is a way I can rewrite this to be more efficient. Right now I have to get a number of records that I need to wrap within another object. Then, I need to group them in a dictionary by id so that if I am asked for a certain rid I can return all objects that have the same rid. The records have been already sorted, so I'm hoping the GroupBy doesn't attempt to sort them beforehand.
RecordHdr[] records = server.GetRecordHdrs(sessId, BATCH_SIZE) // expensive call to server.
// After all RecordHdr objects are retrieved, we loop through the received objects. For each RecordHdr object a WrappedRecordHdr object has to be created.
WrappedRecordHdr[] wrappedRecords = new WrappedRecordHdr[records.Length];
for (int i = 0; i < wrappedRecords.Length; i++)
{
if (records[i] == null || records[i].aid == 0 || records[i].rid == 0) continue; // skip invalid results.
wrappedRecords[i] = new WrappedRecordHdr(AccessorManager, records[i], projectId);
}
// Group all records found in a dictionary of rid => array of WrappedRecordHdrs, so all records with the same
// rid are returned.
objects associated to a particular rid.
Dictionary<int, WrappedRecordHdr[]> dict = wrappedRecords.GroupBy(obj => obj.rid).ToDictionary(gdc => gdc.Key, gdc => gdc.ToArray());
m_ridCache = dict;
As to the data structure, I think there are really two different questions here:
What structure to use;
Should there be one or two caches;
It seems to me that you want one cache, typed as a MemoryCache. The key would be the RID, and the value would be a Dictionary, where the key is an AID and the value is the header.
This has the following advantages:
The WrappedRecordHdrs are stored only once;
The MemoryCache already has all of the caching logic implemented, so you don't need to rewrite that;
When provided with only an RID, you know the AID of each WrappedRecordHdr (which you don't get with the array in the initial post);
These things are always compromises, so this has disadvantages too of course:
Cache access (get or set) requires constructing a string each time;
RID + AID lookups require indexing twice (as opposed to writing some fast hashing function that takes an RID and AID and returns a single key into the cache, however that would require that you either have two caches (one RID only, one RID + AID) or that you store the same WrappedRecordHdr twice per AID (once for RID + AID and once for null + AID));

Packing items vertically (with fixed length and horizontal position)

Let's say i have some items, that have a defined length and horizontal position (both are constant) :
1 : A
2 : B
3 : CC
4 : DDD (item 4 start at position 1, length = 3)
5 : EE
6 : F
I'd like to pack them vertically, resulting in a rectangle having smallest height as possible.
Until now, I have some very simple algorithm that loops over the items and that check row by row if placing them in that row is possible (that means without colliding with something else). Sometimes, it works perfectly (by chance) but sometimes, it results in non-optimal solution.
Here is what it would give for the above example (step by step) :
A | A B | ACC B | ACC B | ACC B | ACC B |
DDD | DDD | FDDD |
EE | EE |
While optimal solution would be :
ADDDB
FCCEE
Note : I have found that sorting items by their length (descending order) first, before applying algorithm, give better results (but it is still not perfect).
Is there any algorithm that would give me optimal solution in reasonable time ? (trying all possibilities is not feasible)
EDIT : here is an example that would not work using sorting trick and that would not work using what TylerOhlsen suggested (unless i dont understand his answer) :
1 : AA
2 : BBB
3 : CCC
4 : DD
Would give :
AA BBB
CCC
DD
Optimal solution :
DDBBB
AACCC
Just spitballing (off the top of my head and just pseudocode ). This algorithm is looping through positions of the current row and attempts to find the best item to place at the position and then moves on to the next row when this row completes. The algorithm completes when all items are used.
The key to the performance of this algorithm is creating an efficient method which finds the longest item at a specific position. This could be done by creating a dictionary (or hash table) of: key=positions, value=sorted list of items at that position (sorted by length descending). Then finding the longest item at a position is as simple as looking up the list of items by position from that hash table and popping the top item off that list.
int cursorRow = 0;
int cursorPosition = 0;
int maxRowLength = 5;
List<Item> items = //fill with item list
Item[][] result = new Item[][];
while (items.Count() > 0)
(
Item item = FindLongestItemAtPosition(cursorPosition);
if (item != null)
{
result[cursorRow][cursorPosition] = item;
items.Remove(item);
cursorPosition += item.Length;
}
else //No items remain with this position
{
cursorPosition++;
}
if (cursorPosition == maxRowLength)
{
cursorPosition = 0;
cursorRow++;
}
}
This should result in the following steps for Example 1 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=1 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| A | ADDD | ADDDB | ADDDB | ADDDB | ADDDB |
F | FCC | FCCEE |
This should result in the following steps for Example 2 (at the beginning of each loop)...
Row=0 | Row=0 | Row=0 | Row=1 | Row=1 | Row=1 | Row=2 |
Pos=0 | Pos=2 | Pos=4 | Pos=0 | Pos=1 | Pos=3 | Pos=0 |
| AA | AACCC | AACCC | AACCC | AACCC | AACCC |
DD | DDBBB |
This is a classic Knapsack Problem. As #amit said, it is NP-Complete. The most efficient solution makes use of Dynamic Programming to solve.
The Wikipedia page is a very good start. I've never implemented any algorithm to solve this problem, but I've studied it relation with the minesweeper game, which is also NP-Complete.
Wikipedia: Knapsack Problem

Categories

Resources