I am currently benchmarking two databases, Postgres and MongoDB, on a relatively large data set with equivalent queries. Of course, I am doing my best to put them on equal grounds, but I have one dilemma. For Postgres I take the execution time reported by EXPLAIN ANALYZE, and there is a similar concept with MongoDB, using profiling (although not equivalent, millis).
However, different times are observed if executed from, lets say, PgAdmin or the mongo CLI client or in my watched C# app. That time also includes the transfer latency, and probably protocol differences. PgAdmin, for example, actually seems to completely deform the execution time (it obviously includes the result rendering time).
The question is: is there any sense in actually measuring the time on the "receiving end", since an application actually does consume that data? Or does it just include too many variables and does not contribute anything to the actual database performance, and I should stick to the reported DBMS execution times?
The question you'd have to answer is why are you benchmarking the databases? If you are benchmarking so you can select one over the other, for use in a C# application, then you need to measure the time "on the 'receiving end'". Whatever variables that may contain, that is what you need to compare.
Related
I have looked around for a simple answer but |I haven't found it (Though if I am being blind or impatient, I be happy for someone to post me the link)
I have the following code in my repository
get
{
if (context.entity.Local.Count == 0)
{
return context.entity;
}
return context.entity.Local;
}
I know from common sense that the word local is not querying the database and getting the result set from memory. However, what I would like to know is, how much faster is fetching the result set from local than it is from the database? It is a huge difference?
I am asking as I would like to speed up my web application so I am trying to find weaknesses in the code.
Thanks
First, your common sense makes no sense at all. Local is nothing that is defined at all in EF so it depends on whoever made the repository and could as well refer to something else.
Second - a lot. Easily factor of 1000. THe database is a separate process, involves generating and then parsing SQL. 2x network transfer (or network alyer transfer). Compare that to just reading out a property. 1000 is likely conservative. Not that it may be a lot of time in the database to start with.
It depends on what you DO - but caching in memory and avoiding the database is a valid strategy that can make a lot of difference, performance wise. At the cost of more memory consumption and change synchronization issues. The later is not really relevant for some (never changing) data.
I find myself faced with a conundrum of which the answer probably falls outside of my expertise. I'm hoping someone can help.
I have an optimised and efficient query for fetching table (and linked) data, the actual contents of which are unimportant. However upon each read that data then needs to be processed to present the data in JSON format. As we're talking typical examples where a few hundred rows could have a few hundred-thousand associated rows this takes time. With multi-threading and a powerful CPU (i7 3960X) this processing is around 400ms - 800ms at 100% CPU. It's not a lot I know but why process it each time in the first place?
In this particular example, although everything I've ever read points to not doing so (as I understood it) I'm considering storing the computed JSON in a VARCHAR(MAX) column for fast reading.
Why? Well the data is read 100 times or more for every single write (change), it seems to me that given those numbers it would be far better to stored the JSON for optimised retrieval and re-compute and update it on the odd occasion the associations are changed - adding perhaps 10 to 20 ms to the time taken to write changes, but improving the reads by some large factor.
Your opinions on this would be much appreciated.
Yes, storing redundant information for performance reasons is pretty common. The first step is to measure the overhead - and it sounds like you've done that already (although I would also ask: what json serializer are you using? have you tried others?)
But fundamentally, yes that's ok, when the situation warrants it. To give an example: stackoverflow has a similar scenario - the markdown you type is relatively expensive to process into html. We could do that on every read, but we have insanely more reads than writes, so we cook the markdown at write, and store the html as well as the source markdown - then it is just a simple "data in, data out" exercise for most of the "show" code.
It would be unusual for this to be a common problem with json, though, since json serialization is a bit simpler and lots of meta-programming optimization is performed by most serializers. Hence my suggestion to try a different serializer before going this route.
Note also that the rendered json may need more network bandwidth that the original source data in TDS - so your data transfer between the db server and the application server may increase; another thing to consider.
So I am troubleshooting some performance problems on a legacy application, and I have uncovered a pretty specific problem (there may be others).
Essentially, the application is using an object relational mapper to fetch data, but it is doing so in a very inefficient/incorrect way. In effect, it is performing a series of entity graph fetches to fill a datagrid in the UI, and on databinding the grid (it is ASP.Net Webforms) it is doing additional fetches, which lead to other fetches, etc.
The net effect of this is that many, many tiny queries are being performed. Using SQL Profiler shows that a certain page performs over 10,000 queries (to fill a single grid. No query takes over 10ms to complete, and most of them register as 0ms in Profiler. Each query will use and release one connection, and the series of queries would be single-threaded (per http request).
I am very familiar with the ORM, and know exactly how to fix the problem.
My question is: what is the exact effect of having many, many small queries being executed in an application? In what ways does it/can it stress the different components of the system?
For example, what is the effect on the webserver's CPU and memory? Would it flood the connection pool and cause blocking? What would be the impact on the database server's memory, CPU and I/O?
I am looking for relatively general answers, mainly because I want to start monitoring the areas that are likely to be the most affected (I need to measure => fix => re-measure). Concurrent use of the system at peak would likely be around 100-200 users.
It will depend on the database but generally there is a parse phase for each query. If the query has used bind variables it will probably be cached. If not, you wear the hit of a parse and that often means short locks on resources. i.e. BAD. In Oracle, CPU and blocking are much more prevelant at the parse than the execute. SQL Server less so but it's worse at the execute. Obviously doing 10K of anything over a network is going to be a terrible solution, especially x 200 users. Volume I'm sure is fine but that frequency will really highlight all the overhead in comms latency and stuff like that. Connection pools generally are in the hundreds, not tens of thousands, and now you have 10s of thousands of objects all being created, queued, managed, destroyed, garbage collected etc.
But I'm sure you already know all this deep down. Ditch the ORM for this part and write a stored procedure to execute the single query to return your result set. Then put it on the grid.
We currently use List<T> to store events from a simulation project we are running. We need to optimise memory utilisation and the time it takes to process the events in order to derive certain key metrics.
We thought of moving the event log to a SQL Server Compact database table and then possibly use Linq to calculate the metrics. From your experience do you think it will be faster to use SQL Server Compact than C#'s built-in data structures or are we going to have issues?
Some ideas.
MSMQ (Microsoft Message Queue)
You can have a thread dequeueing off of MSMQ and updating metrics on the fly. If you need to store these events for later paroosal you can put them into the database as you dequeue them. MSMQ demonstrates much better scalability in these scenarios - especially when the publisher and subscriber have assymetric processing speeds; and binary data is being used (as SQL can get bogged down with allocating space for VARBINARY, or allocating/splitting pages for indexes).
The two other SQL scenarios are complimentary to this one - you can still use dequeueing to insert into SQL; to avoid any hiccups in your simulation while SQL allocates space.
You can side-step what #Aliostad said using this one, to a certain degree.
OLAP (Online Analytical Processing)
Sounds like you might benefit from from OLAP (cubes etc.). This will increase the overall runtime of your simulation but will improve the value of the data. Unfortunately this means forking out cash for one of the bigger SQL editions.
Stored Procedures
While Linq-to-SQL is great for 'your average developer' please keep away from it in scientific projects. There are a host of great tricks you can use in raw TSQL, in addition to being able to inspect the query plan. If you want the best possible performance plan your DB carefully and create stored procedures/UDFs to aggregate your data.
If you can only calculate some of the metrics in C#, do as much work in SQL before-hand - and then feel free to use Linq-to-SQL to grab the data.
Also remember if you are inserting off the end of a MSMQ you can agressively index, which will speed up your metric calculations without impacting your simulation.
I would only involve SQL if there is a real need for better memory utilization (i.e. you are actually running out of it).
Memory Mapped Files
This allows you to offset memory pressure onto disk; at a performance penalty if it needs to be 'paged' back in.
Overall
I could steer clear of Linq to define basic metrics - do it in SQL. MSMQ is without a doubt a huge winner in this case. Don't overcomplicate the memory issue and keep it in .Net if you are not running out of memory.
If you need to process all of the events a C# List<> will be faster than Sql Server. An Array<> will have better performance, especially if the elements are structs and not classes, since structs are put in arrays where class instances only are referenced from the array. Having the structs within the array reduces garbage collection and increases cache locality.
If you only need to process part of the events, I think the solutions are in this order when it come to speed:
C# data structures, crafted especially for your needs.
Sql Server
Naive C# data structures, traversing a list searching for the right elements.
It sounds like you're thinking you need to have them in a database in order to use Linq. This isn't the case. You can use Linq with csharp's built in structures.
Depends on what you mean "faster use". If this is about performance of access to data, it's all about how much data you have, on big data the DB solution, only for statistical purposes, is definitely good choice.
Like DB, for this kind of purposes I would suggest SQLite: as this is single file (no services need like SQL Server compact) fully ACID supported DB. But again, this depends on your data size, as SQLite has limit of data inferior to that one of SQLServer.
Regards.
We need to optimise memory utilisation
Use Sql-Server-CE
the time it takes to process the events
Use Linq-To-Objects.
These two objectives are conflicting and you need to choose one that matters more to you.
I have been given the task of re-writing some libraries written in C# so that there are no allocations once startup is completed.
I just got to one project that does some DB queries over an OdbcConnection every 30 seconds. I've always just used .ExecuteReader() which creates an OdbcDataReader. Is there any pattern (like the SocketAsyncEventArgs socket pattern) that lets you re-use your own OdbcDataReader? Or some other clever way to avoid allocations?
I haven't bothered to learn LINQ since all the dbs at work are Oracle based and the last I checked, there was no official Linq To Oracle provider. But if there's a way to do this in Linq, I could use one of the third-party ones.
Update:
I don't think I clearly specified the reasons for the no-alloc requirement. We have one critical thread running and it is very important that it not freeze. This is for a near realtime trading application, and we do see up to a 100 ms freeze for some Gen 2 collections. (I've also heard of games being written the same way in C#). There is one background thread that does some compliance checking and runs every 30 seconds. It does a db query right now. The query is quite slow (approx 500 ms to return with all the data), but that is okay because it doesn't interfere with the critical thread. Except if the worker thread is allocating memory, it will cause GCs which freeze all threads.
I've been told that all the libraries (including this one) cannot allocate memory after startup. Whether I agree with that or not, that's the requirement from the people who sign the checks :).
Now, clearly there are ways that I could get the data into this process without allocations. I could set up another process and connect it to this one using a socket. The new .NET 3.5 sockets were specifically optimized not to allocate at all, using the new SocketAsyncEventArgs pattern. (In fact, we are using them to connect to several systems and never see any GCs from them.) Then have a pre-allocated byte array that reads from the socket and go through the data, allocating no strings along the way. (I'm not familiar with other forms of IPC in .NET so I'm not sure if the memory mapped files and named pipes allocate or not).
But if there's a faster way to get this no-alloc query done without going through all that hassle, I'd prefer it.
You cannot reuse IDataReader (or OdbcDataReader or SqlDataReader or any equivalent class). They are designed to be used with a single query only. These objects encapsulate a single record set, so once you've obtained and iterated it, it has no meaning anymore.
Creating a data reader is an incredibly cheap operation anyway, vanishingly small in contrast to the cost of actually executing the query. I cannot see a logical reason for this "no allocations" requirement.
I'd go so far as to say that it's very nearly impossible to rewrite a library so as to allocate no memory. Even something as simple as boxing an integer or using a string variable is going to allocate some memory. Even if it were somehow possible to reuse the reader (which it isn't, as I explained), it would still have to issue the query to the database again, which would require memory allocations in the form of preparing the query, sending it over the network, retrieving the results again, etc.
Avoiding memory allocations is simply not a practical goal. Better to perhaps avoid specific types of memory allocations if and when you determine that some specific operation is using up too much memory.
For such a requirement, are you sure that a high-level language like C# is your choice?
You cannot say whether the .NET library functions you are using are internally allocating memory or not. The standard doesn't guarantee that, so if they are not using allocations in the current version of .NET framework, they may start doing so later.
I suggest you profile the application to determine where the time and/or memory are being spent. Don't guess - you will only guess wrong.