Load Full Collection in RavenDB

Load Full Collection in RavenDB - c#

At the startup of an application that uses RavenDB, I need to load the full collection of documents of a specific type and loop through them. The number of documents should always be small (< 1000).
I can do this via:
session.Query<MyType>();
but I want to ensure that the results I am getting are immediately consistent.
It seems that this case falls between a Load() and a Query(), since (I think) the query is intended to be eventually consistent, and the load is intended to be immediately consistent?
But in this case, there should be no index involved (no filtering or sorting), so will using Query() be immediately consistent?

Query() is always immediately inconsistent. session.SaveChanges() stores only to the document store. The indexes are always updated asynchronously later, although for the most part very very fast!
This is generally a poor modeling design and a code smell with a document database. Since you are mentioning that is at application startup and a relatively small amount, it sounds like reference information that changes infrequently. Could you enclose all of this in a single document that contains a List<MyType> instead?
Failing that, you can try the LoadStartingWith() command, which is still fully ACID, and give it the prefix that fits all the necessary documents, although up to 1000 is still a lot and I don't know if that method will return ALL results or cut it off at some point.
If you must use Query(), you will have to use one of the variations of .WaitForNonStaleResults() (the other variants take into consideration what you are waiting for rather than requiring all indexing to be complete) to get consistent results, but that should be fairly safe if the documents change infrequently. Although I really do hate using this method in nearly all its forms, preferring to use any of the above more preferable methods.

Related

Caching large objects for HA with C# asp.net / Net 4.7

I'm trying to cache a large object (around 25MB) that needs to be available for the user for 15 minutes.
In the beginning, I was using MemoryCache (single server) but now that we are going the HA route, we need it to be available to all the servers.
We tried to replace it with Redis, but it takes around 2 minutes (on localhost), between serializing and unserializing the object and the roundtrip (newtonsoft.json serialization).
So, the question is: How do you share large objects that have a short lifespan between servers in a HA?
Thanks for reading :)

I've had good luck switching from JSON to Protobuf ser/de, using the Protobuf-net package. But, it sounds like even if that cut it down to the oft-repeated 6x faster execution time, a 20 second deserialization time probably still won't cut it in this case - since the whole goal is to cache it for a particular user for a "short" period of time.
This sounds like a classic case of eager vs. lazy loading. Since you're already using Redis, have you considered separately caching each property of the object as a separate key? The more numerous the properties, and therefore the smaller each individual one is, the more beneficial this strategy will be. Of course, I'm assuming a fairly orthogonal set of properties on the object - if many of them have dependencies on each other, then this will likely perform worse. But, if the access patterns tend to not require the entire hydrated object, you may improve responsiveness a lot by fetching the demanded individual property instead of the entire object.
I'm assuming a lot about your object - but the simplest step would be implement each property's get accessor to perform the Redis Get call. This has a lot of other downsides regarding dependency management and multi-threaded access, but might be a simple way to achieve a proof of concept.
Keep in mind that this dramatically complicates the cache invalidation requirements. Even if you can store each property individually in Redis, if you then store that value in variable on each machine after fetching, you quickly run into an unmanaged cache situation where you cannot guarantee synchronized data depending on which machine serves the next request.

Processing large resultset with NHibernate

I have following task to do: calculate interest for all active accounts. In the past I was doing things like this using Ado.Net and stored procedures.
This time I've tried to do that with NHibernate, because it seemed that complex algorithms would be done easier with pure POCO.
So I want to do following (pseudocode):
foreach account in accounts
calculate interest
save account with new interest
I'm aware that NHibernate was not designed for processinge large data volumes. For me it is sufficient to have possibility to organize such a loop without having all accounts in memory at once.
To minimize memory usage I would use IStatelessSession for external loop instead of plain ISession.
I've tried approach proposed by Ayende. There are two problems:
CreateQuery is using "magic strings";
more important: it doesn't work as described.
My program works but after switching on Odbc trace I saw in debugger that all fetches were done before lambda expression in .List was executed for the first time.
I've found myself another solution: session.Query returning .AsEnumerable() which I've used in foreach. Again two problems:
I would prefer IQueryOver over IQueryable
still doesn't work as described (all fetches before first interest calculation).
I don't know why but IQueryOver doesn't have AsEnumerable. It also doesn't have List method with argument (like CreateQuery). I've tried .Future but again:
documentation of Future doesn't describe streaming feature
still doesn't work as I need (all fetches before first interest calculation).
In summary: is there any equivalent in NHibernate to dataReader.Read() from Ado.Net?
My best alternative to pure NHibernate approach would be main loop using dataReader.Read() and then Load account with id from Ado.Net loop. However performance will suffer - reading each account via key is slower than sequence of fetches done in outer loop.
I'm using NHibernate version 4.0.0.4000.

While it is true that NH was not designed with large-valume processing in mind you can always circumvent this restriction with application-layer batch processing. I have found that depending on the size of the object graph of the relevant entity, performance will suffer after a certain amount of objects have been loaded into memory (in one small project I could load 100.000 objects and performance would remain acceptable, in an other with only 1500 objects any additional Load() would crawl).
In the past I have used paging to handle batch processing, when IStatelessSession result sets are too poor (as they don't load proxies etc).
So you make a count query in the beginning, make up some arbitrary batch size and then start doing your work on the batch. This way you can neatly avoid the n+1 select problem, assuming that for each batch you explicitly fetch-join everything needed.
The caveat is that for this to work efficiently you will need to evict the processed entities of each batch from the ISession when you are done. And this means that you will have to commit-transaction on each batch. If you can live with multiple flush+commits then this could work for you.
Else you will have to go by the IStatelessSession although there are no lazy queries there. "from Books" means "select * from dbo.Books" or something equivalent and all results are fetched into memory.

Storing computed values in database when reads are far higher than writes

I find myself faced with a conundrum of which the answer probably falls outside of my expertise. I'm hoping someone can help.
I have an optimised and efficient query for fetching table (and linked) data, the actual contents of which are unimportant. However upon each read that data then needs to be processed to present the data in JSON format. As we're talking typical examples where a few hundred rows could have a few hundred-thousand associated rows this takes time. With multi-threading and a powerful CPU (i7 3960X) this processing is around 400ms - 800ms at 100% CPU. It's not a lot I know but why process it each time in the first place?
In this particular example, although everything I've ever read points to not doing so (as I understood it) I'm considering storing the computed JSON in a VARCHAR(MAX) column for fast reading.
Why? Well the data is read 100 times or more for every single write (change), it seems to me that given those numbers it would be far better to stored the JSON for optimised retrieval and re-compute and update it on the odd occasion the associations are changed - adding perhaps 10 to 20 ms to the time taken to write changes, but improving the reads by some large factor.
Your opinions on this would be much appreciated.

Yes, storing redundant information for performance reasons is pretty common. The first step is to measure the overhead - and it sounds like you've done that already (although I would also ask: what json serializer are you using? have you tried others?)
But fundamentally, yes that's ok, when the situation warrants it. To give an example: stackoverflow has a similar scenario - the markdown you type is relatively expensive to process into html. We could do that on every read, but we have insanely more reads than writes, so we cook the markdown at write, and store the html as well as the source markdown - then it is just a simple "data in, data out" exercise for most of the "show" code.
It would be unusual for this to be a common problem with json, though, since json serialization is a bit simpler and lots of meta-programming optimization is performed by most serializers. Hence my suggestion to try a different serializer before going this route.
Note also that the rendered json may need more network bandwidth that the original source data in TDS - so your data transfer between the db server and the application server may increase; another thing to consider.

LINQ performance FAQ

I am trying to get to grips with LINQ. The thing that bothers me most is that even as I understand the syntax better, I don't want to unwittingly sacrifice performance for expressiveness.
Are they any good centralized repositories of information or books for 'Effective LINQ' ? Failing that, what is your own personal favourite high-performance LINQ technique ?
I am primarily concerned with LINQ to Objects, but all suggestions on LINQ to SQL and LINQ to XML also welcome of course. Thanks.

Linq, as a built-in technology, has performance advantages and disadvantages. The code behind the extension methods has had considerable performance attention paid to it by the .NET team, and its ability to provide lazy evaluation means that the cost of performing most manipulations on a set of objects is spread across the larger algorithm requiring the manipulated set. However, there are some things you need to know that can make or break your code's performance.
First and foremost, Linq doesn't magically save your program the time or memory needed to perform an operation; it just may delay those operations until absolutely needed. OrderBy() performs a QuickSort, which will take nlogn time just the same as if you'd written your own QuickSorter or used List.Sort() at the right time. So, always be mindful of what you're asking Linq to do to a series when writing queries; if a manipulation is not necessary, look to restructure the query or method chain to avoid it.
By the same token, certain operations (sorting, grouping, aggregates) require knowledge of the entire set they are acting upon. The very last element in a series could be the first one the operation must return from its iterator. On top of that, because Linq operations should not alter their source enumerable, but many of the algorithms they use will (i.e. in-place sorts), these operations end up not only evaluating, but copying the entire enumerable into a concrete, finite structure, performing the operation, and yielding through it. So, when you use OrderBy() in a statement, and you ask for an element from the end result, EVERYTHING that the IEnumerable given to it can produce is evaluated, stored in memory as an array, sorted, then returned one element at a time. The moral is, any operation that needs a finite set instead of an enumerable should be placed as late in the query as possible, allowing for other operations like Where() and Select() to reduce the cardinality and memory footprint of the source set.
Lastly, Linq methods drastically increase the call stack size and memory footprint of your system. Each operation that must know of the entire set keeps the entire source set in memory until the last element has been iterated, and the evaluation of each element will involve a call stack at least twice as deep as the number of methods in your chain or clauses in your inline statement (a call to each iterator's MoveNext() or yielding GetEnumerator, plus at least one call to each lambda along the way). This is simply going to result in a larger, slower algorithm than an intelligently-engineered inline algorithm that performs the same manipulations. Linq's main advantage is code simplicity. Creating, then sorting, a dictionary of lists of groups values is not very easy-to-understand code (trust me). Micro-optimizations can obfuscate it further. If performance is your primary concern, then don't use Linq; it will add approximately 10% time overhead and several times the memory overhead of manipulating a list in-place yourself. However, maintainability is usually the primary concern of developers, and Linq DEFINITELY helps there.
On the performance kick: If performance of your algorithm is the sacred, uncompromisable first priority, you'd be programming in an unmanaged language like C++; .NET is going to be much slower just by virtue of it being a managed runtime environment, with JIT native compilation, managed memory and extra system threads. I would adopt a philosophy of it being "good enough"; Linq may introduce slowdowns by its nature, but if you can't tell the difference, and your client can't tell the difference, then for all practical purposes there is no difference. "Premature optimization is the root of all evil"; Make it work, THEN look for opportunities to make it more performant, until you and your client agree it's good enough. It could always be "better", but unless you want to be hand-packing machine code, you'll find a point short of that at which you can declare victory and move on.

Simply understanding what LINQ is doing internally should yield enough information to know whether you are taking a performance hit.
Here is a simple example where LINQ helps performance. Consider this typical old-school approach:
List<Foo> foos = GetSomeFoos();
List<Foo> filteredFoos = new List<Foo>();
foreach(Foo foo in foos)
{
if(foo.SomeProperty == "somevalue")
{
filteredFoos.Add(foo);
}
}
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
So the above code will iterate twice and allocate a second container to hold the filtered values. What a waste! Compare with:
var foos = GetSomeFoos();
var filteredFoos = foos.Where(foo => foo.SomeProperty == "somevalue");
myRepeater.DataSource = filteredFoos;
myRepeater.DataBind();
This only iterates once (when the repeater is bound); it only ever uses the original container; filteredFoos is just an intermediate enumerator. And if, for some reason, you decide not to bind the repeater later on, nothing is wasted. You don't even iterate or evaluate once.
When you get into very complex sequence manipulations, you can potentially gain a lot by leveraging LINQ's inherent use of chaining and lazy evaluation. Again, as with anything, it's just a matter of understanding what it is actually doing.

There are various factors which will affect performance.
Often, developing a solution using LINQ will offer pretty reasonable performance because the system can build an expression tree to represent the query without actually running the query while it builds this. Only when you iterate over the results does it use this expression tree to generate and run a query.
In terms of absolute efficiency, running against predefined stored procedures you may see some performance hit, but generally the approach to take is to develop a solution using a system that offers reasonable performance (such as LINQ), and not worry about a few percent loss of performance. If a query is then running slowly, then perhaps you look at optimisation.
The reality is that the majority of queries will not have the slightest problem with being done via LINQ. The other fact is that if your query is running slowly, it's probably more likely to be issues with indexing, structure, etc, than with the query itself, so even when looking to optimise things you'll often not touch the LINQ, just the database structure it's working against.
For handling XML, if you've got a document being loaded and parsed into memory (like anything based on the DOM model, or an XmlDocument or whatever), then you'll get more memory usage than systems that do someting like raising events to indicate finding a start or end tag, but not building a complete in-memory version of the document (like SAX or XmlReader). The downside is that the event-based processing is generally rather more complex. Again, with most documents there won't be a problem - most systems have several GB of RAM, so taking up a few MB representing a single XML document is not a problem (and you often process a large set of XML documents at least somewhat sequentially). It's only if you have a huge XML file that would take up 100's of MB that you worry about the particular choice.
Bear in mind that LINQ allows you to iterate over in-memory lists and so on as well, so in some situations (like where you're going to use a set of results again and again in a function), you may use .ToList or .ToArray to return the results. Sometimes this can be useful, although generally you want to try to use the database's querying rather in-memory.
As for personal favourites - NHibernate LINQ - it's an object-relational mapping tool that allows you to define classes, define mapping details, and then get it to generate the database from your classes rather than the other way round, and the LINQ support is pretty good (certainly better than the likes of SubSonic).

In linq to SQL you don't need to care that much about performance. you can chain all your statements in the way you think it is the most readable. Linq just translates all your statements into 1 SQL statement in the end, which only gets called/executed in the end (like when you call a .ToList()
a var can contain this statement without executing it if you want to apply various extra statements in different conditions. The executing in the end only happens when you want to translate your statements into a result like an object or a list of objects.

There's a codeplex project called i4o which I used a while back which can help improve the performance of Linq to Objects in cases where you're doing equality comparisons, e.g.
from p in People
where p.Age == 21
select p;
http://i4o.codeplex.com/
I haven't tested it with .Net 4 so can't safely say it will still work but worth checking out.
To get it to work its magic you mostly just have to decorate your class with some attributes to specify which property should be indexed. When I used it before it only works with equality comparisons though.

How do I get two thread to insert specific timestamps into a table?

I created two (or more) threads to insert data in a table in database. When inserting, there is a field CreatedDateTime, that of course, stores the datetime of the record creation.
For one case, I want the threads to stay synchronized, so that their CreatedDateTime field will have exactly the same value. When testing with multi threading, usually I've got different milliseconds...
I want to test different scenarios in my system, such as:
1) conflicts inserting record exactly at the same time.
2) problems with ordering/selection of records.
3) Problems with database connection pooling.
4) Problems with multiple users (hundred) accessing at same time.
There may be other test cases I haven't listed here.

Yes, that's what happens. Even if by some freak of nature, your threads were to start at exactly the same time, they would soon get out of step simply because of resource contention between them (at a bare minimum, access to the DB table or DBMS server process).
If they stay mostly in step (i.e., never more than a few milliseconds out), just choose a different "resolution" for your CreatedDateTime field. Put it in to the nearest 10th of a second (or second) rather than millisecond. Or use fixed values in some other way.
Otherwise, just realize that this is perfectly normal behavior.
And, as pointed out by BC in a comment, you may misunderstand the use of the word "synchronized". It's used (in Java, I hope C# is similar) to ensure two threads don't access the same resource at the same time. In actuality, it almost guarantees that threads won't stay synchronized as you understand the term to mean (personally I think your definition is right in terms of English usage (things happening at the same time) but certain computer languages have suborned the definition for their own purposes).
If you're testing what happens when specific timestamps go into the database, you cannot rely on threads "behaving themselves" by being scheduled in a specific order and at specific times. You really need to dummy up the data somehow, otherwise it's like trying to nail jelly to a tree (or training a cat).
One solution is to not use things such as getCurrentTime() or now() but use a specific set of inserts which have known timestamps. Depending on your actual architecture, this may be difficult (for example, if you just call an API which itself gets the current timestamp to millisecond resolution).
If you control the actual SQL that's populating the timestamp column, you need to change that to use pre-calculated values rather the now() or its equivalents.

If you want to have the same timestamps on multiple rows being inserted; you should create a SQL thread which will do a multirow insert in one query which will allow you to get the same timestamps. Other than this, I agree with everyone else, you cannot get an exact timestamp at a huge resolution with multithreads unless you were to insert the timestamp as it is seen in the application and share that timestamp to be inserted. This of course, throws the caveat issues of threads out the window. It's like saying, I'm going to share this data, but I don't want to use mutexes because they stop the other thread from processing once it hits a lock().

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.