Choosing the right sorted collection - c#

I am a bit in doubt on which collection to use for our data.
The domain is this (example):
For each supermarket we add a new item to a collection with a timestamp and total amount each time any customer pays at the register.
We currently do this:
We have a Dictionary collection with key = UniqueSupermarketID and value is a List<{timestamp, amount}>
Each time a customer pays we simply add a new item to the collection for the specific supermarket.
We need to extract data from this dictionary in a way that:
For a specified supermarket, get the newest cash register object with timestamp equaling "some timestamp"
We currently do this as:
supermarketDictionary["supermarket_01"]
.OrderByDescending(i => t.TimeStamp)
.FirstOrDefault(i => i.TimeStamp == 'some timestamp')
This obviously quickly starts performing like crap - so I am trying to figure out which collection to store data in instead.
I am considering using a normal dictionary to hold the "supermarket id <-> cash register list" relation and using a SortedDictionary for the timestamp/amounts used as keys.
Is this the correct approach? I would of course need to implement IComparable correctly on the timestamp to get it to work right.
Update 2014-01-03:
There are currently about 7 million rows in the list in question. The usages of the lists in our system have been identified as these:
_states
.OrderBy(x => x.TimeStamp)
.FirstOrDefault(x => x.WtgId == wtgId && x.IsAvailable && x.TimeStamp >= timeStamp);
_states
.Where(x => x.WtgId == wtgId && x.IsAvailable && x.TimeStamp >= timeStamp && x.TimeStamp <= endDateTime)
.OrderBy(x => x.TimeStamp).ToList();
_states.Remove(state);
if (!_states.Contains(message))
_states.Add(message);
Thanks,
/Jesper
Copenhagen, Denmark

EDIT: based on the update
All right, seeing what you really need sure helps to make a right decision. If your data comes already in order there is no need for a sorted collection and your four usages can be reduced to one ->
Searching for one item that matches some criteria
adding with an existence check - adding is a cheap operation in non-sorted collections and existence check is just a searching for one item
removing by item is also at the most one passing through a collection plus the remove operation itself which is also quite cheap (not in an array if done many times, though)
Try using PLINQ and carefully measure how it performs against LINQ. With so many entries, the difference should be nice.
_states.AsParallel().FirstOrDefault(...);
It will just create a few threads on the background and each of them will search some part of the collection and at the end results are merged. The .NET framework should choose the best number of threads for you, but if you feel like trying, apped .WithDegreeOfParallelism(x) where x is a number of threads it will use.

Related

How do I obtain all the most recent elements from an IEnumerable?

I'm developing a Windows Forms application project for my university and we are using Entity Framework to store things.
It's an e-commerce type of program and now I'm struggling to find the right way to filter an IEnumerable based on the most recent ones.
What I want is to obtain all the elements from this table called prices, in which we also store older prices as a history backup.
This table has the ID of the article that refers to, the same for the corresponding prices list, a public, and a cost price, the updated date that is the moment it was created/updated.
I have tried using many expressions but ultimately failed miserably, sometimes I brought me only the ones within a certain price list or none at all or just one.
Again, I need it to work for a function that lets you update your prices based on parameters. For example, all articles and all price lists. For that, I want only the ones that are up to date so I won't touch the history of prices.
Example of what it should return:
Thank you very much!
Update: What I have tried didn't work, in fact, I couldn't even find sense in the code I wrote, that's why I didn't post it in the first place. I guess this problem ended my brain and I can't think properly anymore.
I tried some answers that I found here. For example:
// This is an IEnumerable of the price DTO class, which has the same properties as the table.
// It contains all the prices without a filter.
var prices= _priceService.Get();
// Attempt 1
var uptodatePrices= prices.GroupBy(x => x.ArticleId)
.Select(x => x.OrderByDescending(s => s.Date).FirstOrDefault());
// Attempt 2
uptodatePrices = prices.Select(x => new PriceDto
{
Date = prices.Where(z=> z.Id == x.Id).Max(g=>g.Date)
});
Ok, It sounds like you want to return the latest price for a combination of price list and article..
You're on the right path with your first attempt, but not quite there. The second attempt looks like pure frustration. :)
I believe the solution you will be looking for will be to group the products, then take the latest price for each group. To do that you need to use the values that identify your group as the group by expression, then sort the grouped results to take your desired one.
var uptodatePrices= prices.GroupBy(x => new { x.ArticloId, x.ListPrecioId} )
.Select(x => x.OrderByDescending(p => p.Date).First())
.ToList();
When you do a GroupBy, the value(s) you specify in the groupby expression become the "Key" of the result. The result also contains an IEnumerable representing the items from the original expression set (prices) that fit that group.
This selects the Price entity, you can change the Select to select a DTO/ViewModel to return, populated by the price instead as well.
In your case you were grouping by just the ArticloId, so you'd get back the latest entry for that Article, but not the combination of article and list price. In the above example I group by both article and list price, then tell it to Select from each group's set, take the latest Price record. I use First rather than FirstOrDefault as because I am grouping on combinations I know there will be at least 1 entry for each combination. (or else there would be no combination) Avoid using ...OrDefault unless you're sure, and are handling that no result may come back.
What you are working with are LINQ queries. If you only need to sort by most recent date, you can do that like this:
prices.OrderByDescending(price=>price.FechaActualizacion).ToList();
Make sure your Price model has the FechaActualizacion property.

Find Distinct Count in MongoDB using C# (mongocsharpdriver)

I have a MongoDB collection, and I need to find the distinct count of records after running a filter.
This is what I have right now,
var filter = Builders<Foo>.Filter.Eq("bar", "1");
db.GetCollection<Foo>("FooTable").Distinct<dynamic>("zoo", filter).ToList().Count;
I don't like this solution as it reads the collection in memory to get the count.
Is there a better way to get distinct count directly from db?
The following code will get the job done using the aggregation framework.
var x = db.GetCollection<Foo>("FooTable")
.Aggregate()
.Match(foo => foo.bar == 1)
.Group(foo => foo.zoo,
grouping => new { DoesNotMatter = grouping.Key })
.Count()
.First()
.Count;
The funky "DoesNotMatter" bit seems required (could have a different name) but the driver won't accept null or anything else... Mind you, it gets optimized away anyway and will not be sent to MongoDB.
Also, this code will execute entirely on the server. It won't, however, use indexes so will probably be slower than what you have at this point.
You current code could be shortened into:
db.GetCollection<Foo>("FooTable").Distinct(d => d.zoo, d => d.bar == 1).ToList().Count;
This will use indexes if available but, yes, it will transfer the list of all distinct values back to the client...

Filter a List via Another

I have a requirement to filter a list of Clients based on if they haven't had any jobs booked in the last x months. In my code I have two lists, one is my Clients and the other is a filtered List of Jobs between today and x months ago and the idea is to filter Clients based on their id not appearing in the jobs list. I tried the following:
filteredClients.Where(n => jobsToSearch.Count(j => j.Client == n.ClientID) == 0).ToList();
But I seem to get ALL clients regardless. I can easily do a foreach but this severly slows down the process. How can I filter the client list based on the job list effectively?
The main thing you're doing wrong is that you don't assign your results back to something. That's why your original seemed to keep all clients. But we can still improve on the original:
filteredClients = filteredClients.Where(n => !jobsToSearch.Any(j => j.Client == n.ClientId)).ToList();
The difference between this and your .Count() solution is that .Any() can stop looking at the jobs list with each client as soon as it encounters the first match, so it should run a bit faster. But we're not done yet. We can do even better by narrowing the jobs list down to only distinct clients:
var badClients = jobsToSearch.Select(j => j.Client).Distinct().ToList();
filteredClients = filteredClients.Where(n => !badClients.Any(j => j == n.ClientId)).ToList();
And likely even better still by using a HashSet, which can make O(1) lookups like a Dcitionary. Assuming the client ID is an int:
var badClients = new HashSet<int>(jobsToSearch.Select(j => j.Client));
filteredClients = filteredClients.Where(n => !badClients.Contains(n.ClientId)).ToList();
Whether this last option performs better depends on the number of clients that have jobs... if the list is short, the .Distinct() might still do better.
Finally, I don't normally recommend calling .ToList() like this. As much as possible, save actually realizing a List, Array, or collection type until the last possible moment, and just keep it to an Enumerable for as long as possible.
did you thought about using "groupby"?
without checking the syntax and writing code out of my mind (havnt vs available atm):
var groupedJobs = jobsearch.GroupBy(job => job.Client);
var itemsWithJobs = filteredList.Where(item => groupedJobs.ContainsKey(item.ClientID));
I can check the syntax tomorrow morning.
The biggest pro of this is, that you have build an Dictonary which is much much faster to search in it. Than to iterate though lists.
To filter the clients who are in IdList;
List1.Where(x=> IdList.Contains(x.ClientId));
To filter the clients who are not in IdList;
List1.Where(x=> !IdList.Contains(x.ClientId));

Very poor performance with a simple query in Entity Framework

So I have a very simple structure:
I have Orders that have a unique OrderNumber
Orders have many OrderRows
OrderRows have many RowExtras that have 2 fields, position (the sequence number of the RowExtra within the OrderRow) and Info, which is a string. More often than not, an OrderRow does not have more than one RowExtra.
(Don't mind the silly structure for now, it's just how it is).
So now I get a list of objects that have three properties:
OrderNumber
Position
Info
What I want to do is simply 1) check if the RowExtra with the given OrderNumber/Position -pair exists in the database and if so, 2) update the Info-property.
I have tried a few different ways to accomplish this with very poor results at best. The solutions loop through the list of objects and issue a query such as
myContext.RowExtras.Where(x => x.Position == currentPosition &&
x.OrderRow.Order.OrderNumber == currentOrderNumber)
or going from the other side
myContext.Orders.Where(x => x.OrderNumber == currentOrderNumber)
.SelectMany(x => x.OrderRows)
.SelectMany(x => x.RowExtras)
.Where(x => x.Position == currentPosition)
and then check if the count equals to 1 and if so, update the property, otherwise proceed to next item.
I currently have roughly 4000 RowExtras in the database and need to update about half of them. These methods make the procedure take several minutes to complete, which is really not acceptable. What I don't understand is why it takes such a long time, because the SQL-clause that returns the required RowExtra would be quite easy to write manually (with just 2 joins and 2 conditions in the where-part).
The best performance I managed to achieve was with a compiledquery looking like this
Func<MyContext, int, string, IQueryable<RowExtra>> query =
CompiledQuery.Compile(
(MyContext ctx, int position, string orderNumber) =>
from extra in ctx.RowExtras
where
extra.Position == position &&
extra.OrderRow.Order == orderNumber
select extra);
and then invoking said query for each object in my list. But even this approach took way over a minute. So how do I actually get this thing to run within a reasonable timeframe?
Also, I'm sorry for the overly long explanation, but hopefully someone can help me!
Try to minimise the number of database calls. As a rule of thumb, each one will take roughly 10ms at least - even one that just returns a scalar.
So, in general, fetch all the data you will need in one go, modify it in code and then save it.
List<Order> orders = myContext.Orders
.Include( "OrderRows.RowExtras" )
.Where( ... select all the orders you want, not just one at a time ... )
.ToList();
foreach ( Order order in orders )
{
... execute your logic on the in-memory model
}
myContext.SaveChanges();

List<T> FirstOrDefault() bad performance - is dictionary possible in this case?

I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.

Categories

Resources