Currently, I'm looking for the optimal way to store data points in a SQL server table, and then read large quantities of them within a .NET APP (Target framework: 3.1). Right now I'm storing my data in a table structure like
CREATE TABLE [DataPoints](
[Id] [int] NOT NULL,
[DateTime] [datetime] NOT NULL,
[Value] [decimal](19, 9) NOT NULL,
CONSTRAINT [PK_Index] PRIMARY KEY CLUSTERED
(
[DateTime] ASC,
[Id] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
Id: is TimeSeries id.
DateTime: value timestamp.
Value: TimeSeries value.
Now, my main issue is the reading part (storage is done over night so consistent write speeds are not important), I'm currently doing some stress tests that includes reading from this table 5 years worth of data for at least 500 TimeSeries ids and this rounds up to 160.000.000 records. Querying this amount of records takes on average 7:30 minutes, more or less.
I'm using Entity framework to retrieve the data, and I've tried different approaches:
Going one TimeSeries id at a time (ranges between 7:20-7:40 minutes)
var dataPoints = context.DataPoints
.AsNoTracking()
.AsQueryable()
.Where(dataPoint => dataPoint.id == id &&
dataPoint.DateTimeUtc >= startDate &&
dataPoint.DateTimeUtc <= endDate);
Including all ids in the query (ranges between 7:30-8:10 minutes)
List<int> ids = new List<int>() {1, 2, 3, 4, .... 498, 499, 500 };
var dataPoints = context.DataPoints
.AsNoTracking()
.AsQueryable()
.Where(dataPoint => ids.Contains(dataPoint.Id) &&
dataPoint.DateTimeUtc >= startDate &&
dataPoint.DateTimeUtc <= endDate);
Basically I just want to know if there is a better way to read this amount of data using SQL server and improve the times it takes to query.
I've also read about InfluxDB, Timescale and MongoDB but before moving to those technologies, I wanted to know if what I want if feasible using the current SQL Database.
That's really not an optimal table design for reading. There's not even an efficient way to seek to a particular Id, so you'll have to scan through all the IDs for the date range.
Try a partitioned columnstore. Columnstores have the best compression and scan speed, and each 1M row chunk has min/max values for each column so can be efficiently skipped. Then partitioning breaks up the table by putting different IDs in different physical data structures.
create partition function pf_tsid(int) as range right for values (0,100,200,300,400,500,600,700)
create partition scheme ps_tsid as partition pf_tsid all to ([Primary])
CREATE TABLE [DataPoints](
[Id] [int] NOT NULL,
[DateTime] [datetime] NOT NULL,
[Value] [decimal](19, 9) NOT NULL,
CONSTRAINT [PK_Index] PRIMARY KEY NONCLUSTERED
(
[DateTime] ASC,
[Id] ASC
) WITH (IGNORE_DUP_KEY = ON)
) on ps_tsid(Id)
create clustered columnstore index cci_DataPoints on DataPoints
You can go as far as to put each time series in its own partition if you like.
Related
I am using an Microsoft SQL Web server on Amazon RDS. The system is currently generating timeouts when updating one column, I am trying to resolve the issue or at least minimize it. Currently the updates occur when a device calls in and they call in a lot, to the point where a device may call back before the webserver finished the last call.
Microsoft SQL Server Web (64-bit)
Version 13.0.4422.0
I see a couple potential possibilities here. First is the device is calling back before the system finished handling the last call so the same record is being updated multiple times concurrently. The second possibility is that I am running into a row lock or table lock.
The table has about 3,000 records in total.
Note I am only trying to update one column in one row at a time. The other columns are never updated.
I don't need to have the last updated time to be very accurate, would there be any benefit to changing the code to only update the column if say greater than a few minutes or would that just add more load to the server? Any suggestion on how to optimize this? Maybe move it to a function, store procedure, or something else?
Suggested new code:
UPDATE [Devices] SET [LastUpdated] = GETUTCDATE()
WHERE [Id] = #id AND
([LastUpdated] IS NULL OR DATEDIFF(MI, [LastUpdated], GETUTCDATE()) > 2);
Existing update code:
internal static async Task UpdateDeviceTime(ApplicationDbContext db, int deviceId, DateTime dateTime)
{
var parm1 = new System.Data.SqlClient.SqlParameter("#id", deviceId);
var parm2 = new System.Data.SqlClient.SqlParameter("#date", dateTime);
var sql = "UPDATE [Devices] SET [LastUpdated] = #date WHERE [Id] = #Id";
// timeout occurs here.
var cnt = await db.Database.ExecuteSqlCommandAsync(sql, new object[] { parm1, parm2 });
}
Table creation script:
CREATE TABLE [dbo].[Devices](
[Id] [int] IDENTITY(1,1) NOT NULL,
[CompanyId] [int] NOT NULL,
[Button_MAC_Address] [nvarchar](17) NOT NULL,
[Password] [nvarchar](max) NOT NULL,
[TimeOffset] [int] NOT NULL,
[CreationTime] [datetime] NULL,
[LastUpdated] [datetime] NULL,
CONSTRAINT [PK_Devices] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
ALTER TABLE [dbo].[Devices] ADD CONSTRAINT [DF_Devices_CompanyId] DEFAULT ((1)) FOR [CompanyId]
GO
ALTER TABLE [dbo].[Devices] ADD CONSTRAINT [DF_Devices_TimeOffset] DEFAULT ((-5)) FOR [TimeOffset]
GO
ALTER TABLE [dbo].[Devices] ADD CONSTRAINT [DF_Devices_CreationTime] DEFAULT (getdate()) FOR [CreationTime]
GO
ALTER TABLE [dbo].[Devices] ADD CONSTRAINT [PK_Devices] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
You should look into the cause by using a tool such as profiler or other techniques to detect blocking. I dont see why you would have a problem updating one column your table with only 3,000 records. It might have something to do with your constraints.
If it really is a timing issue, then you can consider in memory OLTP, designed to handle this type of scenario.
Last updated could also be stored in a transaction based table with a link back to this table with a join using Max(UpdatedTime). In this case you would never update just add new records.
You can then either use partitioning or a cleanup routine to keep the size of this transaction table down.
Programming patterns that In-Memory OLTP will improve include
concurrency scenarios, point lookups, workloads where there are many
inserts and updates, and business logic in stored procedures.
https://msdn.microsoft.com/library/dn133186(v=sql.120).aspx
I've just taken over a project at work, and my boss has asked me to make it run faster. Great.
So I've identified one of the major bottlenecks to be searching through one particular table from our SQL server, which can take up to a minute, sometimes longer, for a select query with some filters on it to run. Below is the SQL generated by C# Entity Framework (minus all the GO statements):
CREATE TABLE [dbo].[MachineryReading](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Location] [geometry] NULL,
[Latitude] [float] NOT NULL,
[Longitude] [float] NOT NULL,
[Altitude] [float] NULL,
[Odometer] [int] NULL,
[Speed] [float] NULL,
[BatteryLevel] [int] NULL,
[PinFlags] [bigint] NOT NULL, -- Deprecated field, this is now stored in a separate table
[DateRecorded] [datetime] NOT NULL,
[DateReceived] [datetime] NOT NULL,
[Satellites] [int] NOT NULL,
[HDOP] [float] NOT NULL,
[MachineryId] [int] NOT NULL,
[TrackerId] [int] NOT NULL,
[ReportType] [nvarchar](1) NULL,
[FixStatus] [int] NOT NULL,
[AlarmStatus] [int] NOT NULL,
[OperationalSeconds] [int] NOT NULL,
CONSTRAINT [PK_dbo.MachineryReading] PRIMARY KEY NONCLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
ALTER TABLE [dbo].[MachineryReading] ADD DEFAULT ((0)) FOR [FixStatus]
ALTER TABLE [dbo].[MachineryReading] ADD DEFAULT ((0)) FOR [AlarmStatus]
ALTER TABLE [dbo].[MachineryReading] ADD DEFAULT ((0)) FOR [OperationalSeconds]
ALTER TABLE [dbo].[MachineryReading] WITH CHECK ADD CONSTRAINT [FK_dbo.MachineryReading_dbo.Machinery_MachineryId] FOREIGN KEY([MachineryId])
REFERENCES [dbo].[Machinery] ([Id])
ON DELETE CASCADE
ALTER TABLE [dbo].[MachineryReading] CHECK CONSTRAINT [FK_dbo.MachineryReading_dbo.Machinery_MachineryId]
ALTER TABLE [dbo].[MachineryReading] WITH CHECK ADD CONSTRAINT [FK_dbo.MachineryReading_dbo.Tracker_TrackerId] FOREIGN KEY([TrackerId])
REFERENCES [dbo].[Tracker] ([Id])
ON DELETE CASCADE
ALTER TABLE [dbo].[MachineryReading] CHECK CONSTRAINT [FK_dbo.MachineryReading_dbo.Tracker_TrackerId]
The table has indexes on MachineryId, TrackerId, and DateRecorded:
CREATE NONCLUSTERED INDEX [IX_MachineryId] ON [dbo].[MachineryReading]
(
[MachineryId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
CREATE NONCLUSTERED INDEX [IX_MachineryId_DateRecorded] ON [dbo].[MachineryReading]
(
[MachineryId] ASC,
[DateRecorded] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
CREATE NONCLUSTERED INDEX [IX_TrackerId] ON [dbo].[MachineryReading]
(
[TrackerId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
When we select from this table, we are almost always interested in one machinery or tracker, over a given date range:
SELECT *
FROM MachineryReading
WHERE MachineryId = 2127 AND
DateRecorded > '2016-12-08 00:00:10.009' AND DateRecorded < '2016-12-11 18:32:41.734'
As you can see, it's quite a basic setup. The main problem is the sheer amount of data we put into it - about one row every ten seconds per tracker, and we have over a hundred trackers at the moment. We're currently sitting somewhere around 10-15 million rows. So this leaves me with two questions.
Am I thrashing the database if I insert 10 rows per second (without batching them)?
Given that this is historical data, so once it is inserted it will never change, is there anything I can do to speed up read access?
You have too many non-clustered indexes on the table - which will increase the size of the DB.
If you have an index on MachineryId and DateRecorded - you don't really need a separate one on MachineryId.
With 3 of your Non-Clustered indexes - there are 3 more copies of the data
Clustered VS Non-Clustered
No Include on the Non-Clustered index
When SQL Server is executing your SQL it is first searching the Non-Clustered Index for the required data, then it is going back to the original table (bookmark lookup) Link and getting the rest of the columns as you are doing select *, but the non-clustered index doesn't have all the columns (That is what I think is happening - Can't really tell without the Query Plan)
Include columns in non-clustered index: https://stackoverflow.com/a/1308325/1910735
You should maintain you indexes - by creating a maintenance plan to check for fragmentation and rebuild or reorganize your indexes on weekly basis.
I really think you should have a Clustered index on your MachineryId and DateRecordred instead of a Non-Clustered index. A table can only have one Clustered Index ( this is the order data is stored on the Hard Disk) - as Most of your queries will be in DateRecordred and MachineryId order - it will be better to store them that way,
Also if you really are searching by TrackerId in any query, try adding it to the same Clustered Index
IMPORTANT NOTE: DELETE THE NON-CLUSTERED INDEX in TEST environment before going LIVE
Create a clustered index instead of your non-clustered index, run different queries - Check the performance by comparing the Query Plans and the STATISTICS IO)
Some resources for Index and SQL Query help:
Subscribe to the newsletter here and download the first responder kit:
https://www.brentozar.com/?s=first+responder
It is now open source - but I don't know if it has the actual PDF getting started and help files (Subscribe in the above link anyway - for weekly articles/tutorials)
https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit
Tuning is per query, but in any case -
I see you have no partitions and no indexes, which means, no matter what you do. it always results in a full table scan.
For your specific query -
create index MachineryReading_ix_MachineryReading_DateRecorded
on (MachineryReading,DateRecorded)
First, 10 inserts per second is very feasible under almost any reasonable circumstances.
Second, you need an index. For this query:
SELECT *
FROM MachineryReading
WHERE MachineryId = 2127 AND
DateRecorded > '2016-12-08 00:00:10.009' AND DateRecorded < '2016-12-11 18:32:41.734';
You need an index on MachineryReading(MachineryId, DateRecorded). That will probably solve your performance problem.
If you have similar queries for tracker, then you want an index on MachineryReading(TrackerId, DateRecorded).
These will slightly impede the progress of in the inserts. But the overall improvement should be so great, that all will be a big win.
I'm currently busy building a small application that monitors a bunch of tables.
Each table contains temperature values punched in by a PLC.
The PLC sends the data every minute.
The client however wished to draw reports for that day or week on a trend table and still display a table below the graph with the values depicted in the graph.
So the challenge comes in that if you look at a day (1440) and even worse a week (10080) the report table will become too long and plotting this on a trend table the width of an A4/Letter page (even in landscape) is just stupid to say the least. And the client knows this, so he asked if I can plot the values for every half an hour.
This leaves me to my question; How do I do a SELECT statement to show me only values for every 30min?
Example Table has the following fields:
[ID] [int] IDENTITY(1,1) NOT NULL,
[Date] [datetime] NOT NULL CONSTRAINT [DF_SAPO_T3_Date] DEFAULT (getdate()),
[Air1] [decimal](5, 2) NULL,
[Name] [varchar](50) NULL,
[Email] [varchar](100) NULL,
[Number] [nvarchar](16) NULL,
[InAlarm] [bit] NULL,
Any help would be great.
Thanks
SELECT *
FROM tableName
WHERE DATEPART(MI,[DATE]) % 30 = 0
AND [DATE] < maxDate AND [DATE] > minDate
The above sql will select every entry where the minute is 0 or 30 (or any other multiple of 30), which assuming you really are polling every minute, should satisfy what you need. It doesn't aggregate the data inbetween though, so if you want an average for each of the 30 minute intervals, you'll need a different piece of SQL.
I think what you want to do is calculate the average over each 30 minute period
DECLARE #StartTime AS DATETIME = '20160101';
DECLARE #EndTime AS DATETIME = '20160108';
WITH SLOTS AS (SELECT #STARTTIME AS T
UNION ALL
SELECT DATEADD(n,30,T) FROM SLOTS WHERE DATEADD(n,30,T) < #EndTime)
select S.T, COALESCE(AVG(Y.AIR1),0) AS Av_Temp from slots S LEFT join YourTable Y ON S.T <= Y.[Date] AND Y.Date < DATEADD(n,30,S.T)
GROUP BY S.T
tweaked it to allow for slots with no readings, possibly - maybe they should say N/A or not be there
I am building an C# application that inserts 2000 records every second using Bulkinsert.
Database version is 2008 R2
The application calls a SP that deletes the records when they are more than 2 hours old in chunks using TOP (10000). This is performed after each insert.
The enduser selects records to view in a diagram using dateranges and a selection of 2 to 10 parameterids.
Since the application will run 24/7 with no downtime i am concerned about performance issues.
Partitioning is not an option since the customer dont have an Enterprise edition.
Is the clustered index definition good?
Is it neccesary to implement any index recreation / reindexation to increase performance due to the fact that rows are inserted in one end of the table and removed in the other end?
What about update statistics, is it still an issue in 2008 R2?
I use OPTION (RECOMPILE) to avoid using outdated queryplans in the select, is that a good approach?
Are there any tablehints that can speed up the SELECT?
Any suggestions around locking strategies?
In addition to the scenario above i have 3 more tables that works in the same way with different timeframes. One inserts every 20 seconds and deletes rows older than 1 week, another inserts every minute and deletes rows older than six weeks and the last inserts every 5 minutes and deletes rows older than 3 years.
CREATE TABLE [dbo].[BufferShort](
[DateTime] [datetime2](2) NOT NULL,
[ParameterId] [int] NOT NULL,
[BufferStateId] [smallint] NOT NULL,
[Value] [real] NOT NULL,
CONSTRAINT [PK_BufferShort] PRIMARY KEY CLUSTERED
(
[DateTime] ASC,
[ParameterId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER PROCEDURE [dbo].[DeleteFromBufferShort]
#DateTime DateTime,
#BufferSizeInHours int
AS
BEGIN
DELETE TOP (10000)
FROM BufferShort
FROM BufferStates
WHERE BufferShort.BufferStateId = BufferStates.BufferStateId
AND BufferShort.[DateTime] < #DateTime
AND (BufferStates.BufferStateType = 'A' OR BufferStates.Deleted = 'True')
RETURN 0
END
ALTER PROCEDURE [dbo].[SelectFromBufferShortWithParameterList]
#DateTimeFrom Datetime2(2),
#DateTimeTo Datetime2(2),
#ParameterList varchar(max)
AS
BEGIN
SET NOCOUNT ON;
-- Split ParameterList into a temporary table
SELECT * INTO #TempTable FROM dbo.splitString(#ParameterList, ',');
SELECT *
FROM BufferShort Datapoints
JOIN Parameters P ON P.ParameterId = Datapoints.ParameterId
JOIN #TempTable TT ON TT.Token = P.ElementReference
WHERE Datapoints.[DateTime] BETWEEN #DateTimeFrom AND #DateTimeTo
ORDER BY [DateTime]
OPTION (RECOMPILE)
RETURN 0
END
This is a classic case of penny wise/pound foolish. You are inserting 150 million records per day and you are not using Enterprise.
The main reason not to use a clustered index is because the machine cannot keep up the quantity of rows being inserted. Otherwise you should always use a clustered index. The decision of whether to use a clustered index is usually argued between those who believe that every table should have a clustered index and those who believe that perhaps one or two percent of tables should not have a clustered index. (I don't have time to engage in a 'religious' type debate about this- just research the web.) I always go with a clustered index unless the inserts on a table are failing.
I would not use the STATISTICS_NORECOMPUTE clause. I would only turn it off if inserts are failing. Please see Kimberly Tripp's (an MVP and a real SQL Server expert) article at http://sqlmag.com/blog/statisticsnorecompute-when-would-anyone-want-use-it.
I would also not use OPTION (RECOMPILE) unless you see queries are not using the right indexes (or join types) in the actual query plan. If your query is executed many times per minute/second this can have an unnecessary impact on the performance of your machine.
The clustered index definition seems good as long as all queries specify at least the leading DateTime column. The index will also maximize insert speed, assuming the times are incremental, as well as reduce fragmentation. You shouldn't need to reorg/reorganize often.
If you have only the clustered index on this table, I wouldn't expect you need to update stats frequently because there isn't another data access path. If you have other indexes and complex queries, verify the index is branded ascending with the query below. You may need to update stats frequently if it is not branded ascending and you have complex queries:
DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.BufferShort', 'PK_BufferShort');
DBCC TRACEOFF(2388);
For the #ParameterList, consider a table-valued-parameter instead. Specify a primary key of Token on the table type.
I would suggest you introduce the RECOMPILE hint only if needed; I suspect you will get a stable plan with a clustered index seek without it.
If you have blocking problems, consider altering the database to specify the READ_COMMITTED_SNAPSHOT option so that row versioning instead of blocking is used for read consistency. Note that this will add 14 bytes of row overhead and use tempdb more heavily, but the concurrency benefits might outweigh the costs.
I need your help :)
I have a table in a database (SQL Server 2008 R2). Currently there are around 4M rows.
Consumer apps take rows from there (lock them and process).
To protect rows from being taken by more than one consumer I'm locking it by adding some flag into appropriate column...
So, to "lock" record I do
SELECT TOP 1 .....
and then UPDATE operation on record with some specific ID.
This operation takes up to 5 seconds now (I tried in SQL Server Management Studio):
SELECT TOP 1 *
FROM testdb.dbo.myTable
WHERE recordLockedBy is NULL;
How can I speed it up?
Here is the table structure:
CREATE TABLE [dbo].[myTable](
[id] [int] IDENTITY(1,1) NOT NULL,
[num] [varchar](15) NOT NULL,
[date] [datetime] NULL,
[field1] [varchar](150) NULL,
[field2] [varchar](150) NULL,
[field3] [varchar](150) NULL,
[field4] [varchar](150) NULL,
[date2] [datetime] NULL,
[recordLockedBy] [varchar](100) NULL,
[timeLocked] [datetime] NULL,
[field5] [varchar](100) NULL);
Indexes should be placed on any columns you use in your query's where clause. Therefore you should add an index to recordLockedBy.
If you don't know about indexes look here
Quicker starter for you:
ALTER TABLE myTable
ADD INDEX IDX_myTable_recordLockedBy (recordLockedBy)
Does your select statement query by id as well? If so, this should be set as a primary key with a clustered index (the default for PKs I believe). SQL will then be able to jump directly to the record - should be near instant. Without it will do a table scan looking at every record in the sequence they appear on disk until it finds the one you're after.
This won't prevent a race condition on the table and allow the same row to be processed by multiple consumers.
Look at UPDLOCK and READPAST locking hints to handle this case:
http://www.mssqltips.com/sqlservertip/1257/processing-data-queues-in-sql-server-with-readpast-and-updlock/
If the table is used for job scheduling and processing, perhaps you can use MSMQ to solve this problem. You don't need to worry about locking and things like that. It also scales much better on enterprise, and has many different send/received modes.
You can learn more about it here:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms711472(v=vs.85).aspx