How to query for the oldest object from db4o?

How to query for the oldest object from db4o? - c#

I have objects that has a DateTime property, how can i query for the oldest object?
After asking on db4o forum, I get the answer:
It's quite easy: create a sorted SODA-Query and take the first / last object from the resulting ObjectSet. Don't iterate the ObjectSet (therefore the objects won't be activated), just take the required object directly via #ObjectSet.Get(index).
Please note: db4o supports just a limited set of performant sortings (alphabetical, numbers, object ids) in query execution, so maybe you have to store your DateTime as milliseconds to achieve good performance.

first of all, your object needs to keep track of the time itself, so it depends on your requirements:
class Customer
{
public DateTime DateSignedUp {get; private set;}
// ...
}
Now, you can query for the object in whatever way you like, using Linq, SODA, or Native Queries, e.g.
IObjectContainer container = ...;
Customer oldestCustomer = container.Query<Customer>().OrderBy(p => p.DateSignedUp).First();
However, there is a set of pitfalls:
Don't use DateTime in your persisted object. I have had massive problems with them. I can't reproduce the problem so I couldn't report it yet, but I can personally not recommend using them. Use a long instead and copy the ticks from the respective DateTime. Store all times in UTC, unless you're explicitly referring to local time, such as in the case of bus schedules.
Put an index on the time
The order operation could be very, very slow for large amounts of objects because of issue COR-1133. If you have a large amount of objects and you know the approximate age of the object, try to impose a where constrain, because that will be fast. See also my blogpost regarding that performance issue, which can become very annoying already at ~50-100k objects.
Best,
Chris

Related

Using .Where() on a List

Assuming the two following possible blocks of code inside of a view, with a model passed to it using something like return View(db.Products.Find(id));
List<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date).ToList();
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id).ToList();
IEnumerable<UserReview> reviews = Model.UserReviews.OrderByDescending(ur => ur.Date);
if (myUserReview != null)
reviews = reviews.Where(ur => ur.Id != myUserReview.Id);
What are the performance implications between the two? By this point, is all of the product related data in memory, including its navigation properties? Does using ToList() in this instance matter at all? If not, is there a better approach to using Linq queries on a List without having to call ToList() every time, or is this the typical way one would do it?

Read http://blogs.msdn.com/b/charlie/archive/2007/12/09/deferred-execution.aspx
Deferred execution is one of the many marvels intrinsic to linq. The short version is that your data is never touched (it remains idle in the source be that in-memory, or in-database, or wherever). When you construct a linq query all you are doing is creating an IEnumerable class that is 'capable' of enumerating your data. The work doesn't begin until you actually start enumerating and then each piece of data comes all the way from the source, through the pipeline, and is serviced by your code one item at a time. If you break your loop early, you have saved some work - later items are never processed. That's the simple version.
Some linq operations cannot work this way. Orderby is the best example. Orderby has to know every piece of data because it possible that the last piece retrieved from the source very well could be the first piece that you are supposed to get. So when an operation such as orderby is in the pipe, it will actually cache your dataset internally. So all data has been pulled from the source, and has gone through the pipeline, up to the orderby, and then the orderby becomes your new temporary data source for any operations that come afterwards in the expression. Even so, orderby tries as much as possible to follow the deferred execution paradigm by waiting until the last possible moment to build its cache. Including orderby in your query still doesn't do any work, immediately, but the work begins once you start enumerating.
To answer your question directly, your call to ToList is doing exactly that. OrderByDescending is caching the data from your datasource => ToList additionally persists it into a variable that you can actually touch (reviews) => where starts pulling records one at a time from reviews, and if it matches then your final ToList call is storing the results into yet another list in memory.
Beyond the memory implications, ToList is additionally thwarting deferred execution because it also STOPS the processing of your view at the time of the call, to entirely process that entire linq expression, in order to build its in-memory representation of the results.
Now none of this is a real big deal if the number of records we're talking about are in the dozens. You'll never notice the difference at runtime because it happens so quick. But when dealing with large scale datasets, deferring as much as possible for as long as possible in hopes that something will happen allowing you to cancel a full enumeration... in addition to the memory savings... gold.
In your version without ToList: OrderByDescending will still cache a copy of your dataset as processed through the pipeline up to that point, internally, sorted of course. That's ok, you gotta do what you gotta do. But that doesn't happen until you actually try to retrieve your first record later in your view. Once that cache is complete, you get your first record, and for every next record you are then pulling from that cache, checking against the where clause, you get it or you don't based upon that where and have saved a couple of in memory copies and a lot of work.
Magically, I bet even your lead-in of db.Products.Find(id) doesn't even start spinning until your view starts enumerating (if not using ToList). If db.Products is a Linq2SQL datasource, then every other element you've specified will reduce into SQL verbiage, and your entire expression will be deferred.
Hope this helps! Read further on Deferred Execution. And if you want to know 'how' that works, look into c# iterators (yield return). There's a page somewhere on MSDN that I'm having trouble finding that contains the common linq operations, and whether they defer execution or not. I'll update if I can track that down.
/*edit*/ to clarify - all of the above is in the context of raw linq, or Linq2Objects. Until we find that page, common sense will tell you how it works. If you close your eyes and imagine implementing orderby, or any other linq operation, if you can't think of a way to implement it with 'yield return', or without caching, then execution is not likely deferred and a cache copy is likely and/or a full enumeration... orderby, distinct, count, sum, etc... Linq2SQL is a whole other can of worms. Even in that context, ToList will still stop and process the whole expression and store the results because a list, is a list, and is in memory. But Linq2SQL is uniquely capable of deferring many of those aforementioned clauses, and then some, by incorporating them into the generated SQL that is sent to the SQL server. So even orderby can be deferred in this way because the clause will be pushed down into your original datasource and then ignored in the pipe.
Good luck ;)

Not enough context to know for sure.
But ToList() guarantees that the data has been copied into memory, and your first example does that twice.
The second example could involve queryable data or some other on-demand scenario. Even if the original data was all already in memory and even if you only added a call to ToList() at the end, that would be one less copy in-memory than the first example.
And it's entirely possible that in the second example, by the time you get to the end of that little snippet, no actual processing of the original data has been done at all. In that case, the database might not even be queried until some code actually enumerates the final reviews value.
As for whether there's a "better" way to do it, not possible to say. You haven't defined "better". Personally, I tend to prefer the second example...why materialize data before you need it? But in some cases, you do want to force materialization. It just depends on the scenario.

List queries 20 times faster than IQueryable?

Here is a test that i have setup this evening. It was made to prove something different, but the outcome was not quite as i expected.
I'm running a test with 10000 random queries on an IQueryable and while testing i found out that if i do the same on a List, my test is 20 times faster.
See below. My CarBrandManager.GetList originally returns an IQueryable, but now i first issue a ToList(), and then it's way faster.
Can anyone tell me something about why i see this big difference?
var sw = new Stopwatch();
sw.Start();
int queries = 10000;
//IQueryable<Model.CarBrand> carBrands = CarBrandManager.GetList(context);
List<Model.CarBrand> carBrands = CarBrandManager.GetList(context).ToList();
Random random = new Random();
int randomChar = 65;
for (int i = 0; i < queries; i++)
{
randomChar = random.Next(65, 90);
Model.CarBrand carBrand = carBrands.Where(x => x.Name.StartsWith(((char)randomChar).ToString())).FirstOrDefault();
}
sw.Stop();
lblStopWatch.Text = String.Format("Queries: {0} Elapsed ticks: {1}", queries, sw.ElapsedTicks);

There are potentially two issues at play here. First: It's not obvious what type of collection is returned from GetList(context), apart from the knowledge that it implements IQueryable. That means when you evaluate the result, it could very well be creating an SQL query, sending that query to a database, and materializing the result into objects. Or it could be parsing an XML file. Or downloading an RSS feed or invoking an OData endpoint on the internet. These would obviously take more time than simply filtering a short list in memory. (After all, how many car brands can there really be?)
But let's suppose that the implementation it returns is actually a List, and therefore the only difference you're testing is whether it's cast as an IEnumerable or as an IQueryable. Compare the method signatures on the Enumerable class's extension methods with those on Queryable. When you treat the list as an IQueryable, you are passing in Expressions, which need to be evaluated, rather than just Funcs which can be run directly.
When you're using a custom LINQ provider like Entity Framework, this gives the framework the ability to evaluate the actual expression trees and produce a SQL query and materialization plan from them. However, LINQ to Objects just wants to evaluate the lambda expressions in-memory, so it has to either use reflection or compile the expressions into Funcs, both of which have a large performance hit associated with them.
You may be tempted to just call .ToList() or .AsEnumerable() on the result set to force it to use Funcs, but from an information hiding perspective this would be a mistake. You would be assuming that you know that the data returned from the GetList(context) method is some kind of in-memory object. That may be the case at the moment, or it may not. Regardless, it's not part of the contract that is defined for the GetList(context) method, and therefore you cannot assume it will always be that way. You have to assume that the type you get back could very well be something that you can query. And even though there are probably only a dozen car brands to search through at the moment, it's possible that some day there will be thousands (I'm talking in terms of programming practice here, not necessarily saying this is the case with the car industry). So you shouldn't assume that it will always be faster to download the entire list of cars and filter them in memory, even if that happens to be the case right now.
If the CarBrandManager.GetList(context) might return an object backed by a custom LINQ provider (like an Entity Framework collection), then you probably want to leave the data cast as an IQueryable: even though your benchmark shows it being 20 times faster to use a list, that difference is so small that no user is ever going to be able to tell the difference. You may one day see performance gains of several orders of magnitude by calling .Where().Take().Skip() and only loading the data you really need from the data store, whereas you'd end up loading the whole table into your system's memory if you call .ToList() on right off the bat.
However, if you know that CarBrandManager.GetList(context) will always return an in-memory list (as the name implies), it should be changed to return an IEnumerable<Model.CarBrand> instead of an IQueryable<Model.CarBrand>. Or, if you're on .NET 4.5, perhaps an IReadOnlyList<Model.CarBrand> or IReadOnlyCollection<Model.CarBrand>, depending on what contract you're willing to force your CarManager to abide by.

Is a Cache of type T possible?

Can we avoid casting T to Object when placing it in a Cache?
WeakReference necessitate the use of objects. System.Runtime.Caching.MemoryCache is locked to type object.
Custom Dictionaries / Collections cause issues with the Garbage Collector, or you have to run a Garbage Collector of your own (a seperate thread)?
Is it possible to have the best of both worlds?
I know I accepted an answer already, but using WeakReference is now possible! Looks like they snuck it into .Net 4.
http://msdn.microsoft.com/en-us/library/gg712911(v=VS.96).aspx
an old feature request for the same.
http://connect.microsoft.com/VisualStudio/feedback/details/98270/make-a-generic-form-of-weakreference-weakreference-t-where-t-class

There's nothing to stop you writing a generic wrapper around MemoryCache - probably with a constraint to require reference types:
public class Cache<T> where T : class
{
private readonly MemoryCache cache = new MemoryCache();
public T this[string key]
{
get { return (T) cache[key]; }
set { cache[key] = value; }
}
// etc
}
Obviously it's only worth delegating the parts of MemoryCache you're really interested in.

So you basically want to dependanct inject a cache provider that only returns certain types?
Isn't that kind of against everything OOP?
The idea of the "object" type is that anything and everything is an object so by using a cache that caches instances of "objects" of type object you are saying you can cache anything.
By building a cache that only caches objects of some predetermined type you are limiting the functionality of your cache however ...
There is nothing stopping you implementing a custom cache provider that has a generic constraint so it only allows you cache certain object types, and this in theory would save you about 2 "ticks" (not even a millisecond) per retrieval.
The way to look at this is ...
What's more important to me:
Good OOP based on best practise
about 20 milliseconds over the lifetime of my cache provider
The other thing is ... .net is already geared to optimise the boxing and unboxing process to the extreme and at the end of the day when you "cache" something you are simply putting it somewhere it can be quickly retrieved and storing a pointer to its location for that retrieval later.
I've seen solutions that involve streaming 4GB XML files through a business process that use objects that are destroyed and recreated on every call .. the point is that the process flow was important not so much the initialisation and prep work if that makes sense.
How important is this casting time loss to you?
I would be interested to know more about scenario that requires such speed.
As a side note:
Another thing i've noticed about newer technologies like linq and entity framework is that the result of query is something that is important to cache when the query takes a long time but not so much the side effects on the result.
What this means is that (for example):
If i was to cache a basic "default instance" of an object that uses a complex set of entity queries to create, I wouldn't cache the resulting object but the queries.
With microsoft already doing the ground work i'd ask ... what am i caching and why?

List<> items lost during ToArray()?

A while ago, I coded a system for collecting public transport disruptions. Information about any incident is collected in an MSSQL database. Consumers access these data by calling an .asmx web service. Data are collected from the DB using ADO.NET, each data row is then populating a Deviation object and added to a List. In the service layer, the list is applied a ToArray() call and returned to the consumer.
So far, so good. But the problem is that in some cases (5% or so), we have been aware that the array somehow is curtailed. Instead of the usual number of 15-20 items, only half of them, or even fewer, are returned. The remaining items are always at the end of the original list. And, even fewer times, a couple of items are repeated/shuffled at the beginning of the array.
After doing some testing on the different layers, it seems as the curtailing occurs at the end of the process, i.e. during the casting to an array or the SOAP serialization. But the code seems so innocent, huh??:
[WebMethod]
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
return DeviRoutines.GetDeviationsByTimeInterval(from, to).ToArray();
}
I am not 100% sure the error doesn't occur in the SQL or data access layer, but they have proved to do their job during the testing. Any help on the subject would be of great help! :)

I'd do something like:
public Deviation[] GetDeviationsByTimeInterval(DateTime from, DateTime to)
{
var v1 = DeviRoutines.GetDeviationsByTimeInterval(from, to);
LogMe( "v1: " + v1.Count );
var v2 = v1.ToArray();
LogMe( "v2: " + v2.Length );
return v2;
}
Proofing what you expect usually pays out :-)

http://msdn.microsoft.com/en-us/library/x303t819.aspx
Type: T[] An array containing copies
of the elements of the List.
You didn't find a bug in .NET, it's most likely something in your GetDeviationsByTimeInterval

I'd be willing to bet ToArray is doing exactly what it's told, but either your from or to values are occasionally junk (validation error?) or GetDeviationsByTimeInterval is misinterpreting them for some reason.
Stick some logging into both Deviation[] and GetDeviationsByTimeInterval to see what values get passed to it, and the next time it goes pear-shaped you'll be able to diagnose where the problem is.

code performance question

Let's say I have a relatively large list of an object MyObjectModel called MyBigList. One of the properties of MyObjectModel is an int called ObjectID. In theory, I think MyBigList could reach 15-20MB in size. I also have a table in my database that stores some scalars about this list so that it can be recomposed later.
What is going to be more efficient?
Option A:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int RowID = PutScalarsInDB(MyBigList);
Option B:
List<MyObjectModel> MyBigList = null;
MyBigList = GetBigList(some parameters);
int TheCount = MyBigList.Count();
StringBuilder ListOfObjectID = null;
foreach (MyObjectModel ThisObject in MyBigList)
{
ListOfObjectID.Append(ThisObject.ObjectID.ToString());
}
int RowID = PutScalarsInDB ( TheCount, ListOfObjectID);
In option A I pass MyBigList to a function that extracts the scalars from the list, stores these in the DB and returns the row where these entries were made. In option B, I keep MyBigList in the page method where I do the extraction of the scalars and I just pass these to the PutScalarsInDB function.
What's the better option, and it could be that yet another is better? I'm concerned about passing around objects this size and memory usage.

I don't think you'll see a material difference between these two approaches. From your description, it sounds like you'll be burning the same CPU cycles either way. The things that matter are:
Get the list
Iterate through the list to get the IDs
Iterate through the list to update the database
The order in which these three activities occur, and whether they occur within a single method or a subroutine, doesn't matter. All other activities (declaring variables, assigning results, etc.,) are of zero to negligible performance impact.
Other things being equal, your first option may be slightly more performant because you'll only be iterating once, I assume, both extracting IDs and updating the database in a single pass. But the cost of iteration will likely be very small compared with the cost of updating the database, so it's not a performance difference you're likely to notice.
Having said all that, there are many, many more factors that may impact performance, such as the type of list you're iterating through, the speed of your connection to the database, etc., that could dwarf these other considerations. It doesn't look like too much code either way. I'd strongly suggesting building both and testing them.
Then let us know your results!

If you want to know which method has more performance you can use the stopwatch class to check the time needed for each method. see here for stopwatch usage: http://www.dotnetperls.com/stopwatch
I think there are other issues for a asp.net application you need to verify:
From where do read your list? if you read it from the data base, would it be more efficient to do your work in database within a stored procedure.
Where is it stored? Is it only read and destroyed or is it stored in session or application?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.