Using: SQL Server 2008, Entity Framework
I am summing columns in a table across a date/time range. The table is straight-forward:
DataId bigint IDENTITY(1,1) NOT NULL PRIMARY KEY,
DateCollected datetime NOT NULL,
Length int NULL,
Width int NULL,
-- etc... (several more measurement variables)
Once I have the date/time range, I use linq-to-EF to get the query back
var query = _context.Data.Where(d =>
(d.DateCollected > df &&
d.DateCollected < dt))
I then construct my data structure using the sum of the data elements I’m interested in
DataRowSum row = new DataRowSum
{
Length_Sum = query.Sum(d => d.Length.HasValue ? (long)d.Length.Value : 0),
Width_Sum = query.Sum(d => d.Width.HasValue ? (long)d.Width.Value : 0),
// etc... (sum up the rest of the measurement variables)
}
While this works, it results in a lot of DB round trips and is quite slow. Is there a better way to do this? If it means doing it in a stored procedure, that’s fine with me. I just need to get the performance better since we’ll only be adding more measurement variables and this performance issue will just get worse.
SQL SERVER is very good at rolling up summary values. Create a proper stored procedure which calculates the sums for you already. This will give you maximum performance, especially if you don't actually need the tabular data in your client program. Just have SQL Server roll up the summary, and send back a whole lot less data. One of the reasons I generally don't like LINQ is because it tempts the programmers to do things like what you are trying to do (pull a set and do 'something' against every row), instead of taking advantage of the database engine and all its capabilities.
Do this with aggregate functions and grouping in the SQL. LINQ will never figure out how to do this fast.
Related
Just out of curiosity, how exactly does SELECT * FROM table WHERE column = "something" works?
Is the underlying principle same as that of a for/foreach loop with an if condition like:
for (iterator)
{
if(condition)
//print results
}
If am dealing with , say 100 records, will there be any considerable performance difference between the 2 approaches in getting the desired data I want ?
SQL is a 4th generation language, which makes it very different from programming languages. Instead of telling the computer how to do something (loop through rows, compare columns), you tell the computer what to do (get the rows matching a condition).
The DBMS may or may not use a loop. It could as well use hashes and buckets, pre-sort a data set, whatever. It is free to choose.
On the technical side, you can provide an index in the datebase, so the DBMS can look up the keys to quickly to access the rows (like quickly finding names in a telephone book). This gives the DBMS an option how to acces the data, but it is still free to use a completely different approach, e.g. read the whole table sequentially.
I have a list of 2048 doubles that represent the amplitude of different samples in a signal. I'm constantly re-sampling the signal (say 20 times every second) and would like to store the data in an SQL database.
One complete measurement of the signal looks something like this:
List<double> sineCurve = new List<double>();
for (int i = 0; i < 2048; ++i)
{
sineCurve.Add(50 + 50 * Math.Sin(i / Math.PI));
}
What's the best way to store this data? The simplest way seems to be to create a table with 2049 columns and store each measurement as a new row:
string sql = "create table sample( measurement_id int not null, sample_value1 double, sample_value2 double, ... sample_value1 double2048 )";
Is this the preferred way of storing this type of data?
Edit: I would like to run a test where data is collected and saved to the database continuously over a few days. Then I would like to process the data by looking for certain patterns in each list of samples, min/max values and so forth.
I would create a table like this, where you store all your samples in a measurement in one BLOB-entry:
CREATE TABLE measurement (id (primary key) , generated_at (timestamp), sample (blob), SequenceNr (integer));
Then if you want to look for measurements in a given time range it would look kind of like this:
SELECT m.sample FROM measurement m
WHERE m.generated_at BETWEEN startDate AND endDate
ORDER BY m.SequenceNr;
You should know in advance what kind of queries you wish to run against your data. Find those relevant values before saving a record and put in separate fields. Otherwise save the array in a blob field as binary.
If you find out other queries you have to reprocess the blob values by picking the values you need.
I have a problem concerning application performance: I have many tables, each having millions of records. I am performing select statements over them using joins, where clauses and orderby on different criterias (specified by the user at runtime). I want to get my records paged but no matter what I do with my SQL statements I cannot reach the performance of getting my pages directly from memory. Basically the problem comes when I have to filter my records by using some runtime dynamic specified criteria. I tried everything such as using ROW_NUMBER() function combined with a "where RowNo between" clause, I've tried CTE, temp tables, etc. Those SQL solutions performs well only if I don't include filtering. Keep in mind also that I want my solution to be as generic as possible (imagine that i have in my app several lists that virtually presents paged millions of records and those records are constructed with very complex sql statements).
All my tables has a primary key of type INT.
So, I come with an ideea: Why not create a "server" only for select statements. The server loads first all records from all tables and stores them into some HashSets where each T has an Id property and GetHashCode () returns that Id and also the Equals is implemented such that two records are "equal" only if Id is equal (don't scream, You will see later why I am not using all record data for hashing and comparisons).
So far so good, but there's a problem: How can I sync my in memory collections with database records?. The ideea is that I must find a solution such as I load only differential changes. So I invented a changelog table for each table that I want to cache. In this changelog I perform only inserts that marks dirty rows (updates or deletes) and also records newly inserted ids, all of this mechanism implemented using triggers. So whenever an in-memory select comes, I check first if I must sync something (by interogating the changelog). If something must be applied, I load the changelog, I apply those changes in memory and finally I am clearing that changelog (or maybe remember what was the highest changelog id that I've applied ...).
In order to be able to apply the changelog in O ( N ) where N is the changelog size, i am using this algo:
for each log.
identify my in-memory Dictionary <int, T> where the key is the primary key.
if it's a delete log then call dictionary.Remove (id) ( O ( 1 ))
if it's an update log, then call also dictionary.Remove (id) ( O (1)) and move this id into an "to be inserted" collection
if it's an insert log, move this id into a "to be inserted" collection.
finally, refresh cache by selecting all data from the corresponding table where Id in ("to be inserted").
For filtering, I am compiling some expression trees into Func < T, List < FilterCriterias >, bool > functors. Using this mechanism I am performing way more faster than SQL.
I Know that SQL 2012 has caching support and the new comming SQL version will suport even more but My client have SQL server 2005 so ... I can't benefit of this stuff.
My question: What do you think ? this is a bad ideea ? there's a better aproach ?
The developers of SQL Server did a very good job. I think it is fairly impossible to trick this out.
Unless your data has some kind of implicit structure which might help to speed things up and which the optimizer cannot be aware of, such "I do my own speedy trick" approaches won't help - normally...
Performance problems are ever first to be solved where they occur:
the tables structures and relations
indexes and statistics
quality of SQL statements
Even many million rows are no problem if the design and the queries are good...
If your queries do a lot of computations, or you need to retrieve data out of tricky structures (nested list with recursive reads, XML...) I'd go the Data-Warehouse-Path and write some denormalized tables for quick selects. Of course you will have to deal with the fact, that you are reading "old" data. If your data does not change much, you could trigger all changes to a denormalized structure immediately. But this depends on your actual situation.
If you want, you could post one of your imperformant queries together with the relevant structure details and ask for review. There are dedicated groups on Stack-Exchange, such as "Code Review". If it's not to big, you might try it here as well...
Here is the situation:
I'm working on a basic search for a somewhat big entity. Right now, the amount of results is manageable but I expect a very large amount of data after a year or two of use, so performance is important here.
The object I'm browsing has a DateTime value and I need to be able to output all objects with the same month, regardless of the year. There are multiple search fields that can be combined, but the other fields do not cause a problem here.
I tried this :
if(model.SelectedMonth != null)
{
contribs = contribs.Where(x => x.Date.Value.Month == model.SelectedMonth);
}
model.Contribs = contribs
.Skip(NBRESULTSPERPAGE*(model.CurrentPage - 1))
.Take(NBRESULTSPERPAGE)
.ToList();
So far all I get is "Invalid 'where' condition. An entity member is invoking an invalid property or method." I thought of just invoking ToList() but it doesn't seem to be very efficient, again the entity is quite big. I'm looking for a clean way to make this work.
You said:
The object I'm browsing has a DateTime value and I need to be able to output all objects with the same month, regardless of the year
...
I expect a very large amount of data after a year or two of use, so performance is important here.
Right there, you have a problem. I understand you are using LINQ to CRM, but this problem would actually come up regardless of what technology you're using.
The underlying problem is that date and time is stored in a single field. The year, month, day, hour, minute, seconds, and fractional seconds are all packed into a single integer that represents the number of units since some time. In the case of a DateTime in .NET, that's the number of ticks since 1/1/0001. If the value is stored in a SQL DateTime2 field, it's the same thing. Other data types have different start dates (epochs) and different precisions. But in all cases, there's just a single number internally.
If you're searching for a value that is in a month of a particular year, then you could get decent performance from a range query. For example, give all values >= 2014-01-01 and < 2014-02-01. Those two points can be mapped back to their numeric representation in the database. If the field has an index, then a range query can use that index.
But if the value you're looking for is just a month, then any query you provide will require the database to extract that month from each and every value in the table. This is also known as a "table scan", and no amount of indexing will help.
A query that can effectively use an index is known as a sargable query. However, the query you are attempting is non-sargable because it has to evaluate every value in the table.
I don't know how much control over the object and its storage you have in CRM, but I will tell you what I usually recommend for people querying a SQL Server directly:
Store the month in a separate column as a single integer. Sometimes this can be a computed column based on the original datetime, such that you never have to enter it directly.
Create an index that includes this column.
Use this column when querying by month, so that the query is sargable.
This is a guess and really should be a comment, but it's too much code to format well in a comment. If it's not helpful I'll delete the answer.
Try moving model.SelectedMonth to a variable rather than putting it in the Where clause
var selectedMonth = model.SelectedMonth;
if(selectedMonth != null)
{
contribs = contribs.Where(x => x.Date.Value.Month == selectedMonth);
}
you might do the same for CurrentPage as well:
int currentPage = model.CurrentPage;
model.Contribs = contribs
.Skip(NBRESULTSPERPAGE*(currentPage - 1))
.Take(NBRESULTSPERPAGE)
.ToList();
Many query providers work better with variables than properties of non-related objects.
What is the type of model.SelectedMonth?
According to your code logic it is nullable, and it appears that it might be a struct, so does this work?
if (model.SelectedMonth.HasValue)
{
contribs = contribs.Where(x => x.Date.Value.Month == model.SelectedMonth.Value);
}
You may need to create a Month OptionSet Attribute on your contrib Entity, which is populated via a plugin on the create/update of the entity for the Date Attribute. Then you could search by a particular month, and rather than searching a Date field, it's searching an int field. This would also make it easy to search for a particular month in the advanced find.
The Linq to CRM Provider isn't a full fledged version of Linq. It generally doesn't support any sort of operation on the attribute in your where statement, since it has to be converted to whatever QueryExpressions support.
I have learned that SQL Server stores DateTime differently than the .NET Framework. This is very unfortunate in the following circumstance: Suppose I have a DataRow filled from my object properties - some of which are DateTime - and a second DataRow filled from data for that object as persisted in SQL Server:
DataRow drFromObj = new DataRow(itemArrayOfObjProps);
DataRow drFromSQL = // blah select from SQL Server
Using the DataRowComparer on these two DataRows will give an unexpected result:
// This gives false about 100% of the time because SQL Server truncated the more precise
// DateTime to the less precise SQL Server DateTime
DataRowComparer.Default.Equals(drFromObj, drFromSQL);
My question was going to be, 'How do other people deal with reality in a safe and sane manner?' I was also going to rule out converting to strings or writing my own DataRowComparer. I was going to offer that, in absence of better advice, I would change the 'set' on all of my DateTime properties to convert to a System.Data.SqlTypes.SqlDateTime and back upon storage thusly:
public Nullable<DateTime> InsertDate
{
get
{
if (_InsDate.HasValue)
return _InsDate;
else
return null;
}
set
{
if (!object.ReferenceEquals(null, value) && value.HasValue)
_InsDate = (DateTime)(new System.Data.SqlTypes.SqlDateTime(value));
}
}
I know full well that this would probably get screwed up as I used the _InsDate variable directly somewhere rather than going through the property. So my other suggestion was going to be simply using System.Data.SqlTypes.SqlDateTime for all properties where I might want a DateTime type to round trip to SQL Server (and, happily, SqlDateTime is nullable). This post changed my mind, however, and seemed to fix my immediate problem. My new question is, 'What are the caveats or real world experiences using the SQL Server datetime2(7) data type rather than the good, old datetime data type?'
TL;DR: Comparing dates is actually hard, even though it looks easy because you get away with it most of the time.
You have to be aware of this issue and round both values yourself to the desired precision.
This is essentially the same issue as comparing floating point numbers. If two times differ by four nanoseconds, does it make sense for your application to consider them different, or the same?
For example, if two servers have logged the same event, searching for corresponding records, you wouldn't say "no that can't be the correct event because the time is wrong by 200 nanoseconds". Clocks can differ by that amount on two servers no matter how hard they try to keep their time synchronised. You might accept that an event seen on server A and logged with a time a couple of seconds after the time on server B might have been actually seen simultaneously or the other way around.
Note:
If you are comparing data which is supposed to have made some sort of round-trip out of the database, you may find it has been truncated to the second or minute. (For example if it has been through Excel or an old VB application, or been written to a file and parsed back in.)
Data originating from external sources is generally rounded to the day, the minute, or the second. (except sometimes logfiles, eventlogs or electronic dataloggers, which may have milliseconds or better)
If the data has come from SQL Server, and you are comparing it back to itself (for example to detect changes), you may not encounter any issues as they will be implicitly truncated to the same precision.
Daylight savings and timezones introduce additional problems.
If searching for dates, use a date range. And make sure you write the query in such a way that any index can be used.
Somewhat related:
Why doesn't this sql query return any results comparing floating point numbers?
Identity increments. Sort by Identity and you get insert order. You (can) control insert order.
I seriously doubt output would ever by out of order but if you don't trust it you can use #SeeMeSorted
DECLARE #SeeMeSort TABLE
( [ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](20) NOT NULL);
DECLARE #SeeMeSorted TABLE
( [ID] [int] primary key NOT NULL,
[Name] [nvarchar](20) NOT NULL);
insert into #SeeMeSort ([Name])
OUTPUT INSERTED.[ID], INSERTED.[name]
values ('fff'), ('hhh'), ('ggg');
insert into #SeeMeSort ([Name])
OUTPUT INSERTED.[ID], INSERTED.[name]
into #SeeMeSorted
values ('xxx'), ('aaa'), ('ddd');
select * from #SeeMeSorted order by [ID];