I'm currently using sqlite embedded to store relatively big lists of data (starting from 100'000 rows per table). Queries include only:
paging
sorting by a field
Amount of data in a row is relatively small. Performance is really bad, especially for the first query, which is critical for my application. All kinds of tunings and pre-caching already tried and reached the practical limit.
Is there any alternative of an embedded data store library which can do these simple queries in a very fast and efficient way? Theres no requirement for it to support sql at all.
If it is (predominantly) read-only, consider using memory mapped views of a file.
It will be possible to achieve maximum performance rolling your own indexes.
Obviously it will be also be the most work-intensive and error-prone to roll-your-own.
May I suggest a traditional RDBMS with good indexes or perhaps a newfangled no-SQL style DB that supports your work-load?
You can try lucene.net, it is blazing fast, does not require any installation, supports paging and sorting by fields and much much more.
http://incubator.apache.org/lucene.net/
With Simple Lucene wrapper it is also quite easy to use: http://blogs.planetcloud.co.uk/mygreatdiscovery/post/SimpleLucene-e28093-Lucenenet-made-easy.aspx
Related
I'm building a solution where I'm retrieving large amounts of data from the database(5k to 10k records). Our existing data access layer uses Fluent NHibernate but I'm "scared" that I will incur a large amount of overhead by hydrating object models that represent the database entities.
Can I retrieve simply an ADO dataset?
Yes you should be concerned about the performance of this. You can look at using the IStatelessSession functionality of NHibernate. However this probably won't give you the performance you are looking for. While I haven't used NH since 2.1.2GA, I would find it unlikely that they've substantially improved the performance of NH when it comes to bulk operations. To put it bluntly, NH just sucks (and most ORMs in general for that matter) when it comes to bulk operations.
Q: Can I retrieve simply an ADO dataset?
Of course you can. Just because you're using NHibernate doesn't mean you can't new up an ADO.NET connection and hit the database in the raw.
As much as I loathe data tables and data sets, this one of the rare cases you might want to consider using them instead of adding the overhead of mapping / creating the objects associated with your 10K rows of data.
Depending on how much performance you need, there are a few options. Nothing will ever beat using a sqldatareader, as that's what's underneath just about every .NET ORM implementation. In addition to being the fastest, it can take a lot less memory if you don't need to save a list of all the records after the query.
Now as for your performance worries, 5k-10k records isn't that high. I've pulled upwards of a million rows out of nhibernate before, but obviously the records weren't huge and it was for a single case. If you're doing this on a high traffic website then of course you will have to be more efficient if you're hitting bottlenecks. If you're thinking about datasets, I'd suggest instead trying Massive because it should still be more efficient than DataSet/Table and more convenient.
You can use "Scalar queries", which is in fact native SQL query returning a list of object[] (one object[] per row):
sess.CreateSQLQuery("SELECT * FROM CATS")
.AddScalar("ID", NHibernateUtil.Int64)
.AddScalar("NAME", NHibernateUtil.String)
.AddScalar("BIRTHDATE", NHibernateUtil.Date)
Example from NHibernate documentation: http://nhibernate.info/doc/nh/en/index.html#d0e10794
We have a reporting tool that is grabbing a large amount of records. At times it can be 1 million records. We have been storing this in a datable. I wanted to to know if there was a better object to store this in. I would need to be able to aggregate the data in various ways.
Update:
Yes. Personally believe that should not being getting that many records. This is not the direction I want to go.
Also I am using Oracle
Update Update
Sorry for the delay, but there are always fire to put out here. The main issue was they were running out of memory and getting memory errors. They had issues with the datatable releasing from memory and also binding to a datagridview. I guess what I was looking for was a lighter weight object that wouldn't take as much space.
After thinking about a little more, it really doesn't make any sense to get that much data as diagonalbatman mentioned. furthermore if we have just a few people are using it with these issues. How is it going to scale.
Unfortunately, I have a boss that doesn't listen and an offshore team that is too much of a "yes sir" type attitude. They are serializing the raw data (as an XML file) and releasing the raw data Datatable which I think is not a good direction at all.
#diagonalbatman - OUt of curiousity, do you have an example of this
Why do you need to draw down 1 Milion records into your app?
Can you not do your reporting consolidation / aggregation on the DB? This would make better use of the DB's resources (after all this is what an RDBMS is designed to do) then you can focus your app on working with smaller consolidated sets?
I would recommend you try several options to verify, especially in light of your needed ability to aggregate the data in various ways.
1) Can it be aggregated by proper queries on the data side, this is likely the best solution.
2) if you use POCOs does LINQ improve upon your current memory and performance characteristics. Does LINQ allow you to to the aggregation you require.
Measure the characteristics you care about and try different options.
What you want are Data Cubes. Depending on the type of database you have, you should look at building some Cubes.
I am working on an application with a kinda simple data model (4 tables including two small, having around 10 rows, and two bigger, having hundreds of rows).
I'm working with C# and currently use an OdbcDriver for my Data Access Layer.
I was wondering if there is any difference in terms of performance between this driver or NHibernate?
The application works but I'd like to know if installing NHibernate instead of a classic OdbcDriver would make it faster? If so, is the difference really worth installing NHibernate? (according to the fact that I have never used such technology)
Thanks!
Short answer: no, NHibernate will actually slow your performance in most cases.
Longer answer: NHibernate uses the basic ADO.NET drivers, including OdbcConnection (if there's nothing better), to perform the actual SQL queries. On top of that, it is using no small amount of reflection to digest queries into SQL, and to turn SQL results into lists of objects. This extra layer, as flexible and powerful as it is, is going to perform more slowly than a hard-coded "firehose" solution based on a DataReader.
Where NHibernate may get you the APPEARANCE of faster performance is in "lazy-loading". Say you have a list of People, who each have a list of PhoneNumbers. You are retrieving People from the database, just to get their names. A naive DataReader-based implementation may involve calling a stored procedure for the People that includes a join to their PhoneNumbers, which you don't need in this case. Instead, NHibernate will retrieve only People, and set a "proxy" into the reference to the list of PhoneNumbers; when the list needs to be evaluated, the proxy object will perform another call. If the phone number is never needed, the proxy is never evaluated, saving you the trouble of pulling phone numbers you don't need.
NHibernate isn't about making it faster and it'll alwasy be slower than just using the database primatives like you are (it uses them "under the hood").
In my opinion NHibernate about making a reusable entity layer that can be applied to different applications or at the very least reused in multiple areas in one medium to large application. Therefore moving your application to NHibernate would be a waste of time (it sounds very small).
You might get better performance by using a specific datbase driver for your database engine.
For amount of data in your database it won't make any difference. But in general using NHibernate will slow down application performance, but increase development speed. But this is generally true for all ORM's.
SOme hint: NHIbernate is not magic. It sits on top of ADO.NET. Want a faster driver? GET ONE. Why are yo using a slow outdated technilogy like ODbc anyway? WHat is your data source? Don't they support ANY newer standard like OLEDB?
Is it possible to use Lucene as full fledged data store (like other(mongo,couch) nosql variants).
I know there are some limitations like newly updated documents by one indexer will not be shown in other indexer. So we need to restart the indexer to get the updates.
But i stumble upon solr lately, it seems these problems are avoided by some kind of snapshot replication.
So i thought i could use lucene as a data store since this also uses same kind of documents(JSON based) used by mongo and couch internally to manage documents, and its proven indexing algorithm fetches the records super fast.
But i am curious has anybody tried that before..? if not what are reasons not choosing this approach.
There is also the problem of durability. While a Lucene index should not get corrupted ever, I've seen it happen. And the approach Lucene takes to repairing a broken index is "throw it away and rebuild from the original data". Which makes perfect sense for an indexing tool. But it does require you to have the data stored somewhere else.
I've only worked with Solr, the Lucene derivative (and I would recommend using Solr to just about anyone) so my opinion may be a little biased but it should be possible to use Solr as a datastore yes, however it wouldn't be very useful without something more permanent in the background.
The problem you may encounter is that entering data into Solr does not guarantee you will get it back when you expect it. Baring the use of pretty strict faceting you may encounter problems retrieving your data simply because the indexer has decided to lump your results in a certain way.
I've experimented a little with this approach but the only real benefit I saw was in situations where you want the search index on the client side so that they can search quickly internally a then query the database for extended information.
My suggestion is to use solr for search and then have it return a short sample of the data you may want as well as an index for further querying in a traditional data store.
TL;DR: Yes, but I wouldn't recommend it.
The Guardian uses Solr as their data store. You can see some of their reasons in that slideshow.
In any case, I think their website is very heavily trafficked (certainly more so than anything I work on), so I think I would feel comfortable saying that Solr will probably work for you., since it scales to their requirements.
Imagine the following:
I have a table of 57,000 items that i regularly use in my application to figure out things like targeting groups etc.
instead of querying the database 300,000 times a day, for a table that hardly ever changes it's data, is there a way to store its information in my application and poll data in memory directly? Or, do I have to create some sort of custom datatype for each row and iterate through testing each row, to check for the results i want?
After some googling, the closest thing i could find is in-memory database
thank you,
- theo
SQLite supports in-memory tables.
For 57,000 items that you will be querying against (and want immediately available) I would not recommend implementing just simple caching. For that many items I'd either recommend a distributed memory cache (even if it's only one machine) such as Memcache, Velocity etc or to go with your initial idea of using an in memory database.
Also if you use any full fledged ORM such as NHibernate you can implement it to use clients for the distributed caching tools with almost no work. Many of the major clients have NHibernate implementations for them including Memcache, Velocity and some others. This might be a better solution as you can have it where it's only caching data it truly is using and not all the data it might need.
Read up on Caching
It sounds like this is application level data rather than user level, so you should look into "Caching Application Data"
Here are some samples of caching datatables
If you only need to find rows using the same key all the time, a simple Dictionary<Key, Value> could very well be all that you need. To me, 57,000 items doesn't really sound that much unless each row contains a huge amount of data. However, if you need to search by different columns, an in-memory database is most likely the way to go.