So here is the deal,
I will be dealing with 1 second data that may accumulate for up to a month with up to 40 sensors (columns) this data will be in exactly one second increments and I need to be able to quickly run logic on these sensor data values. So doing the math:
30 days * 24 hours/day * 60 min/hour * 60 sec/min is roughly:
2.5 million rows x 40 columns is:
100 million data points
Most the data is of double datatype, but I will have booleans, ints, and strings as well. I am entirely new to this so all my ideas on how to handle this are entirely original and therefore they may be absurd and I may be missing the obvious solution. So here are some options I have considered:
1)1 massive table - I am concerned about the performance of this option.
2)DataTables by date, thus reducing the size of an individual table to 86,000 rows.
3)DataTables by hour only dealing with 3600 rows per table, however at that point I start to have an excessive number of tables.
Let me give a little more detail on the nature of the logic - it is entirely sequential, meaning I will start at the beginning and go from the first row all the way to the last. I will essentially be making circles through the data with a number of different algorithms in order to produce my desired results. I am running Sql on Azure. My natural inclination is to move fast and make a lot of mistakes, but in this case I think some experience and advice on how I set up this database will pay off. So, any suggestions?
Related
I've got a sql query which returns 20 columns and about 500 000 rows at this moment. The values are running because people are working on the data in database.
Most columns in the query isn't simple selects but there is a lot of 'case when'. Data is joined from 5 tables.
Is there a way to show the data in GridView in efficient manner. Now i show all data (500000 rows) and it takes long time. I've tried pagination but when i want to for example take 100 rows with offset 10 rows, the whole query is executed and it takes to long.
How could i cope with this?
I think you have two sepparte issues here:
Slow query: be sure to optimize your query. There are litteraly thousands of articles on the net. My first option is always to check indexes on the columns I'm joining the tables by. Start with analyzing execution plan, you'll quickly discover the main problem(s).
The sheer number of records. 500000 is at least 100 times too big for any human, completly unusable. There are two solutions: limit the number (add another criteria) of returned records or use Server mode for grid.
First I build a list (by reading existing files) of approximately 12,000 objects that look like this:
public class Operator
{
string identifier; //i.e "7/1/2017 MN01 Day"
string name1;
string name2;
string id1;
string id2;
}
The identifier will be unique within the list.
Next I run a large query (currently about 4 million rows but it could be as large as 10 million, and about 20 columns). Then I write all of this to a CSV line by line using a write stream. For each line I loop over the Operator list to find a match and add those columns.
The problem I am having is with performance. I expect this report to take a long time to run but I've determined that the file writing step is taking especially long (about 4 hours). I suspect that it has to do with looping over the Operator list 4 million times.
Is there any way I can improve the speed of this? Perhaps by doing something when I build the list initially (indexing or sorting, maybe) that will allow searching to be done much faster.
You should be able to greatly speed up your code by building a Dictionary(HashTable):
var items = list.ToDictionary(i => i.identifier, i => i);
You can then index in on this dictionary:
var item = items["7/1/2017 MN01 Day"];
Building the dictionary is an O(n) operation, and doing a lookup into the dictionary is an O(1) operation. This means that your time complexity becomes linear rather than exponential.
... but also, "couldn't you somehow put those operators into a database table, so that you could use some kind of JOIN operation in your SQL?"
Another possibility that comes to mind is ... "twenty different queries, one for each symbol." Or, a UNION query with twenty branches. If there is any way for the SQL engine to use indexes, on its side, to speed up that process, you would still come out ahead.
Right now, vast amounts of time might be being wasted, packaging up every one of those millions of lines, squirting them through the network wires to your machine, only to have to discarding most of them, say, because they don't match any symbol.
If you control the database and can afford the space, and if, say, most of the rows don't match any symbol, consider a symbols table and a symbols_matched table, the second being a many-to-many join table that pre-identifies which rows match which symbol(s). It might well be worth the space, to save the time. (The process of populating this table could be put to a stored procedure which is TRIGGERed by appropriate insert, update, and delete events ...)
It is difficult to tell you how to speed up your file write without seeing any code.
But in general it could be worth considering writing using multiple threads. This SO post has some helpful info, and you could of course Google for more.
I have issue on the performance of my program using C#.
In first loop, the table will insert and update 175000 records with 54 secs.
In the second loop, with 175000 records, 1 min 11 secs.
Next, the third loop, with 18195 1 min 28 secs.
The loop going on and the time taken is more for 125 records can go up to 2 mins.
I am wondering why would smaller records taking longer time to update? Does the number of records updating does not give effect on the time taken to complete the loop?
Can anyone enlighten me on this?
Flow of Program:
Insert into TableA (date,time) select date,time from rawdatatbl where id>=startID && id<=maxID; //startID is the next ID of last records
update TableA set columnName = values, columnName1 =values, columnName2 = values, columnName.....
I'm using InnoDB.
Reported behavior seems consistent with growing size of table, and inefficient query execution plan for UPDATE statements. Most likely explanation would be that the UPDATE is performing a full table scan to locate rows to be updated, because an appropriate index is not available. And as the table has more and more rows added, it takes longer and longer to perform the full table scan.
Quick recommendations:
review the query execution plan (obtained by running EXPLAIN)
verify suitable indexes is available and are being used
Apart from that, there's tuning of the MySQL instance itself. But that's going to depend on which storage engine the tables are using, MyISAM, InnoDB, et al.
Please provide SHOW CREATE TABLE for both tables, and the actual statements. Here are some guesses...
The target table has indexes. Since the indexes are built as the inserts occur, any "random" indexes will become slower and slower.
innodb_buffer_pool_size was so small that caching became a problem.
The UPDATE seems to be a full table update. Well, the table is larger each time.
How did you get startID from one query before doing the next one (which has id>=startID)? Perhaps that code is slower as you get farther into the table.
You say "in the second loop", where is the "loop"? Or were you referring to the INSERT...SELECT as a "loop"?
I need to get around 50k-100k records from a table. Two of the fields hold very long strings. Field1 is up to 2048 characters and field2 is up to 255.
Getting just these two fields, 50k rows takes around 120 seconds. Is there a way to use compression or some how optimize the retrieval of this data? I'm using a data adapter to fill a data table.
Note: It's just a select statement, no where clause.
Simple answer: DONT PULL 50.000 to 100.000 rows. Point. Mass transfers always take time, and compression would put a lot of stress on the cpu. I still have to come to a case where pulling that much data outside of pure data transfers is a worthwhile proposition - most of the time it is a sign of a bad architecture.
I use the following columns stored in a SQL table called tb_player:
Date of Birth (Date), Times Played (Integer), Versions (Integer)
to calculate a "playvalue" (integer) in the following formula:
playvalue = (Today - Date of Birth) * Times Played * Versions
I display upto 100 of these records with the associataed playvalue on a webpage at any time.
My question is, what is the most efficient way of calculating this playvalue given it will change only once a day, due to the (today-date of birth) changing? The other values (times played & versions) remain the same.
Is there a better way than calculating this on the fly each time for the 100 records? If so, is it more efficient to do the calculation in a stored proc or in VB.NET/C#?
In a property/method on the object, in C#/VB.NET (your .NET code).
The time to execute a simple property like this is nothing compared to the time to call out-of-process to a database (to fetch the rows in the first place), or the transport time of a web-page; you'll never notice it if just using it for UI display. Plus it is on your easily-scaled-out hardware (the app server), and doesn't involve a huge update daily, and is only executed for rows that are actually displayed, and only if you actually query this property/method.
Are you finding that this is actually causing a performance problem? I don't imagine it would be very bad, since the calculation is pretty straightforward math.
However, if you are actually concerned about it, my approach would be to basically set up a "playvalue cache" column in the tb_player table. This column will store the calculated "playvalue" for each player for the current day. Set up a cronjob or scheduled task to run at midnight every day and update this column with the new day's value.
Then everything else can simply select this column instead of doing the calculation, and you only have to do the calculation once a day.