I am using MariaDB. I have a table that I create for every IoT device at the time of the first insertion with a stored procedure. If anyone wonders Why I create a new table for every device is devices publish data every 5 seconds and it is impossible for me to store all of them in a single table.
So, my table structure is like below:
CREATE TABLE IF NOT EXISTS `mqttpacket_',device_serial_number,'`(
`data_type_id` int(11) DEFAULT NULL,
`data_value` int(11) DEFAULT NULL,
`inserted_date` DATE DEFAULT NULL,
`inserted_time` TIME DEFAULT NULL,
FOREIGN KEY(data_type_id) REFERENCES datatypes(id),
INDEX `index_mqttpacket`(`data_type_id`,`inserted_date`)) ENGINE = INNODB;
I have a very long SELECT query like below to fetch the data between selected type, date, and time.
SELECT mqttpacket_123.data_value, datatypes.data_name, datatypes.value_mult,
CONCAT(mqttpacket_123.inserted_date, ' ',
mqttpacket_123.inserted_time) AS 'inserted_date_time'
FROM mqttpacket_123
JOIN datatypes ON mqttpacket_123.data_type_id = datatypes.id
WHERE mqttpacket_123.data_type_id IN(1,2,3,4,5,6)
AND CASE WHEN mqttpacket_123.inserted_date = '2021-11-08'
THEN mqttpacket_123.inserted_time > '12:25:00'
WHEN mqttpacket_123.inserted_date = '2021-11-15'
THEN mqttpacket_123.inserted_time< '12:25:00'
ELSE (mqttpacket_123.inserted_date BETWEEN '2021-11-08'
AND '2021-11-15')
END;
and this returns around 500k records of the sample below:
| data_value | data_name | value_mult | inserted_date_time |
--------------------------------------------------------------------------------
| 271 | name_1 | 0.1 | 2021-11-08 12:25:04 |
| 106 | name_2 | 0.1 | 2021-11-08 12:25:04 |
| 66 | name_3 | 0.1 | 2021-11-08 12:25:04 |
| 285 | name_4 | 0.1 | 2021-11-08 12:25:04 |
| 61 | name_5 | 0.1 | 2021-11-08 12:25:04 |
| 454 | name_6 | 0.1 | 2021-11-08 12:25:04 |
| 299 | name_7 | 0.1 | 2021-11-08 12:25:04 |
Affected rows: 0 Found rows: 395,332 Warnings: 0 Duration for 1 query: 0.734 sec. (+ 7.547 sec. network)
I keep only the last 2 weeks' data in my tables and clean up the previous data as I have a backup system.
However, Loading the query result to DataTable also takes ~30sec. which is 4 times slower than MySQL.
Do you have any suggestions to improve this performance?
PS. I call this query from C# by the following statement in a Stored Procedure of RunQuery which takes the query and performs it as it is.
public DataTable CallStoredProcedureRunQuery(string QueryString)
{
DataTable dt = new DataTable();
try
{
using (var conn = new MySqlConnection(_connectionString))
{
conn.Open();
using (var cmd = new MySqlCommand("SP_RunQuery", conn))
{
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.Add("#query_string", MySqlDbType.VarChar).Value = QueryString;
using (MySqlDataAdapter sda = new MySqlDataAdapter(cmd))
{
sda.Fill(dt);
}
}
}
}
catch (Exception ex)
{
IoTemplariLogger.tLogger.EXC("Call Stored Procedure for RunQuery failed.", ex);
}
return dt;
}
EDIT: My sensors push a single MQTT packet which contains ~50 different data. There are 12 times 5seconds in a minute. So, basically, I receive ~600 rows per minute per device.
Data insertion is done in a Stored Procedure async. I push the JSON content along with the device_id and I iterate on the JSON to parse and insert into the table.
PS. The following code is just for clarification. It works fine.
/*Dynamic SQL -- IF they are registered to the system but have notable, create it.*/
SET create_table_query = CONCAT('CREATE TABLE IF NOT EXISTS `mqttpacket_',device_serial_number,'`(`data_type_id` int(11) DEFAULT NULL, `data_value` int(11) DEFAULT NULL,`inserted_date` DATE DEFAULT NULL, `inserted_time` TIME DEFAULT NULL, FOREIGN KEY(data_type_id) REFERENCES datatypes(id), INDEX `index_mqttpacket`(`data_type_id`,`inserted_date`)) ENGINE = InnoDB;');
PREPARE stmt FROM create_table_query;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
/*Loop into coming value array. It is like: $.type_1,$.type_2,$.type_3, to iterate in the JSON. We reach each value like $.type_1*/
WHILE (LOCATE(',', value_array) > 0)
DO
SET arr_data_type_name = SUBSTRING_INDEX(value_array,',',1); /*pick first item of value array*/
SET value_array = SUBSTRING(value_array, LOCATE(',',value_array) + 1); /*remove picked first item from the value_array*/
SELECT JSON_EXTRACT(incoming_data, arr_data_type_name) INTO value_iteration; /*extract value of first item. $.type_1*/
SET arr_data_type_name := SUBSTRING_INDEX(arr_data_type_name, ".", -1); /*Remove the $ and the . to get pure data type name*/
/*Check the data type name exists or not in the table, if not insert and assign it's id to lcl_data_type_id*/
IF (SELECT COUNT(id) FROM datatypes WHERE datatypes.data_name = arr_data_type_name) > 0 THEN
SELECT id INTO lcl_data_type_id FROM datatypes WHERE datatypes.data_name = arr_data_type_name LIMIT 1;
ELSE
SELECT devices.device_type_id INTO lcl_device_type FROM devices WHERE devices.id = lcl_device_id LIMIT 1;
INSERT INTO datatypes (datatypes.data_name,datatypes.description,datatypes.device_type_id,datatypes.value_mult ,datatypes.inserted_time) VALUES(arr_data_type_name,arr_data_type_name,lcl_device_type,0.1,NOW());
SELECT id INTO lcl_data_type_id FROM datatypes WHERE datatypes.data_name = arr_data_type_name LIMIT 1;
END IF;
/*To retrieve the table of which device has which datatypes inserted, this is to not to retrieve the datatypes unneccesseraly for the selected device*/
IF (SELECT COUNT(device_id) FROM devicedatatypes WHERE devicedatatypes.device_id = lcl_device_id AND devicedatatypes.datatype_id = lcl_data_type_id) < 1 THEN
INSERT INTO devicedatatypes (devicedatatypes.device_id, devicedatatypes.datatype_id) VALUES(lcl_device_id,lcl_data_type_id);
END IF;
SET lcl_insert_mqtt_query = CONCAT('INSERT INTO mqttpacket_',device_serial_number,'(data_type_id,data_value,inserted_date,inserted_time) VALUES(',lcl_data_type_id,',',value_iteration,',''',data_date,''',''',data_time,''');');
PREPARE stmt FROM lcl_insert_mqtt_query;
EXECUTE stmt;
SET affected_data_row_count = affected_data_row_count + 1;
END WHILE;
Here and here are also extra information that can be found of the server and database regarding the comments.
I have an SSD on the server. There is nothing important else that works other than my dotnet application and database.
It is usually better to have a DATETIME column instead of splitting it into two (DATE and TIME) columns. That might simplify the WHERE clause.
Having one table per device is usually a bad idea. Instead, add a column for the device_id.
Not having a PRIMARY KEY is a bad idea. Do you ever get two readings in the same second for a specific device? Probably not.
Rolling those together plus some other likely changes, start by changing the table to
CREATE TABLE IF NOT EXISTS `mqttpacket`(
`device_serial_number` SMALLINT UNSIGNED NOT NULL,
`data_type_id` TINYINT UNSIGNED NOT NULL,
`data_value` SMALLINT NOT NULL,
`inserted_at` DATETIME NOT NULL,
FOREIGN KEY(data_type_id) REFERENCES datatypes(id),
PRIMARY KEY(device_serial_number, `data_type_id`,`inserted_at`)
) ENGINE = INNODB;
That PK will make the query faster.
This may be what you are looking for after the change to DATETIME:
AND inserted_at >= '2021-11-08 12:25:00'
AND inserted_at < '2021-11-08 12:25:00' + INTERVAL 7 DAY
To keep 2 weeks' worth of data, DROP PARTITION is an efficient way to do the delete. I would use PARTITION BY RANGE(TO_DAYS(inserted_at)) and have 16 partitions, as discussed in http://mysql.rjweb.org/doc.php/partitionmaint
If you are inserting a thousand rows every 5 seconds -- With table-per-device, you would need a thousand threads each doing one insert. This would be a nightmare for the architecture. With a single table (as I suggest), and if you can get the 1000 rows together in a process at the same, time, do one multi-row INSERT every 5 seconds. I discuss other high speed ingestion.
Rate Per Second = RPS
Suggestions to consider for your instance [mysqld] section
innodb_io_capacity=500 # from 200 to use more of available SSD IOPS
innodb_log_file_size=256M # from 48M to reduce log rotation frequency
innodb_log_buffer_size=128M # from 16M to reduce log rotation avg 25 minutes
innodb_lru_scan_depth=100 # from 1024 to conserve 90% CPU cycles used for function
innodb_buffer_pool_size=10G # from 128M to reduce innodb_data_reads 85 RPS
innodb_change_buffer_max_size=50 # from 25 percent to expedite pages created 590 RPhr
Observation,
innodb_flush_method=O_DIRECT # from fsync for method typically used on LX systems
You should find these significantly improve task completion performance. View profile for free downloadable Utility Scripts to assist with performance tuning.
There are additional opportunities to tune Global Variables.
Related
I'm building a function in C# to unpivot a complex table in a CSV file and insert it into a SQL table. The file looks something like this:
| 1/5/2018 | 1/5/2018 | 1/6/2018 | 1/6/2018...
City: | min: | max: | min: | max:
Boston(KBOS) | 1 | 10 | 5 | 12
My goal is to unpivot it like so:
airport_code | localtime | MinTemp | MaxTemp
KBOS | 1/5/2018 | 1 | 10
KBOS | 1/6/2018 | 5 | 12
My strategy is:
Store the first row of dates and the second row of headers into arrays
Use a CSV parser to read each following line and loop through each field
If the date that corresponds to the current field is same as the previous one, it's belongs in the same row. Put the data into the appropriate field.
Since there are only two temperature fields for each row, this row is complete can now be inserted.
Otherwise, start a new row and put the data into the appropriate field.
However, I'm running into a problem: Once insertRow is populated and inserted, I can't overwrite it or null all the fields and use it again - that throws an error that row has already been inserted. I can't move the declaration of insertRow inside the for loop because I need to preserve the data through multiple iterations to completely fill out the row. So instead I tried to declare it outside the loop but only initialize it inside the loop, something like:
if(insertRow == null)
{
insertRow = MyDataSet.tblForecast.NewtblForecastRow();
}
But that throws a "use of unassigned local variable" error. Any ideas about how I can preserve insertRow on some iterations and dispose of it on others? Or, any suggestions about a better way to do what I'm looking for? The relevant portion of the code is below:
using (TextFieldParser csvParser = new TextFieldParser(FileName))
{
csvParser.SetDelimiters(new string[] { "," });
csvParser.ReadLine(); //Skip top line
string[] dateList = csvParser.ReadFields();//Get dates from second line.
string[] fieldNames = csvParser.ReadFields();//Get headers from third line
//Read through file
while (!csvParser.EndOfData)
{
DataSet1.tblForecastRow insertRow = MyDataSet.tblForecast.NewtblForecastRow();
string[] currRec = csvParser.ReadFields();
//Get airport code
string airportCode = currRec[0].Substring(currRec[0].LastIndexOf("(") + 1, 4);
//Unpivot record
DateTime currDate = DateTime.Parse("1/1/1900");//initialize
DateTime prevDate;
for (int i = 1; i<fieldNames.Length; i++) //skip first col
{
prevDate = currDate;//previous date is the prior current date
DateTime.TryParse(dateList[i], out currDate);//set new current date
int val;
int.TryParse(currRec[i], out val);
switch (fieldNames[i].ToLower())
{
case "min:":
insertRow["MinTemp"] = val;
break;
case "max:":
insertRow["MaxTemp"] = val;
break;
}
if (currDate == prevDate)//if same date, at end of row, insert
{
insertRow["airport_code"] = airportCode;
insertRow["localTime"] = currDate;
insertRow["Forecasted_date"] = DateTime.Today;
MyDataSet.tblForecast.AddtblForecastRow(insertRow);
ForecastTableAdapter.Update(MyDataSet.tblForecast);
}
}
}
}
You create a new row when you finished handling the current one. And you already know where that is:
if (currDate == prevDate)//if same date, at end of row, insert
{
insertRow["airport_code"] = airportCode;
insertRow["localTime"] = currDate;
insertRow["Forecasted_date"] = DateTime.Today;
// we're storing insertRow
MyDataSet.tblForecast.AddtblForecastRow(insertRow);
// now it gets saved (man that is often)
ForecastTableAdapter.Update(MyDataSet.tblForecast);
// OKAY, let's create the new insertRow instance
insertRow = MyDataSet.tblForecast.NewtblForecastRow();
// and now on the next time we end up in this if
// the row we just created will be inserted
}
Your initial Row can be created outside the loop:
// first row creation
DataSet1.tblForecastRow insertRow = MyDataSet.tblForecast.NewtblForecastRow();
//Read through file
while (!csvParser.EndOfData)
{
// line moved out of the while loop
string[] currRec = csvParser.ReadFields();
I'm running a 3-node Cassandra 3.0.0 cluster running on AWS EC2 i3.large instances and I've been playing around with using the C# driver for Cassandra. Executing the following query (which is very simple) takes approximately 300 ms (to scan the single partition and return the top 100 rows).
var rs = session.Execute("SELECT col1, col6, col7 FROM breadcrumbs WHERE col1='samplepk' LIMIT 100;");
My data model is:
Column 1 = a 13-character string
Column 2 = a 23-character string
Column 3 = a date/time timestamp
Column 4 = a 4-digit integer
Column 5 = a 3 digit integer
Column 6 = a latitude value
Column 7 = a longitude value
Column 8 = a 15-digit double
Column 9 = a 15-digit double
I defined my primary key as Col1, col2.
My C# driver code is as follows:
Cluster cluster = Cluster.Builder().AddContactPoint(~~~~~IP Here~~~~).Build();
ISession session = cluster.Connect(~~~keyspacename~~~);
long ticks = DateTime.Now.Ticks;
var rs = session.Execute("SELECT col2, col6, col7 FROM breadcrumbs WHERE partitionkey=~targetkey~ LIMIT 100;");
Console.WriteLine((DateTime.Now.Ticks - ticks)/Math.Pow(10,4)+" ms");
Console.ReadKey();
Is that abnormally slow or are my expectations too high? If it is slow, does anyone have any ideas about what's causing it?
If I forgot to provide any pertinent details, please leave a comment :) .
Thanks in advance.
I'm trying to collect statistics on some SQL queries.
I'm using RetrieveStatistics() method of SqlDbConnection class to get statistics and ExecuteReader() method of SqlCommand to run query.
RetrieveStatistics() method returns dictionary filled with statistics on the query executed.
When I'm running regular query, SelectRows propety of dictionary contains actual number of rows returned by query. But when I'm running stored procedure, SelectRows is always zero, although reader definitely contains rows.
I call ResetStatistics() before each query and StatisticsEnabled is set to true.
Here's my Powershell code:
### Stored procedure
$cn = New-Object system.data.sqlclient.sqlconnection
$cn.ConnectionString = "Data Source=localhost;Initial Catalog=XXXX;Integrated Security=SSPI"
$cn.StatisticsEnabled = $true
$cmd = $cn.CreateCommand()
$cmd.CommandText = "[dbo].[spGetXXXX]"
$cmd.CommandType = "StoredProcedure"
$cn.Open()
$cmd.ExecuteReader()
# several rows returned
$cn.RetrieveStatistics()
Name Value
---- -----
BytesReceived 300
SumResultSets 1
ExecutionTime 5
Transactions 0
BuffersReceived 1
IduRows 0
ServerRoundtrips 1
PreparedExecs 0
BytesSent 132
SelectCount 1
CursorOpens 0
ConnectionTime 51299
Prepares 0
SelectRows 0
UnpreparedExecs 1
NetworkServerTime 3
BuffersSent 1
IduCount 0
### Regular SQL query
$cn2 = New-Object system.data.sqlclient.sqlconnection
$cn2.ConnectionString = "Data Source=localhost;Initial Catalog=XXXX;Integrated Security=SSPI"
$cn2.StatisticsEnabled = $true
$cmd2 = $cn2.CreateCommand()
$cmd2.CommandText = "SELECT * FROM XXXX"
$cn2.Open()
$cmd2.ExecuteReader()
#rows returned
$cn2.RetrieveStatistics()
Name Value
---- -----
BytesReceived 12357
SumResultSets 1
ExecutionTime 12
Transactions 0
BuffersReceived 2
IduRows 0
ServerRoundtrips 1
PreparedExecs 0
BytesSent 98
SelectCount 1
CursorOpens 0
ConnectionTime 11407
Prepares 0
SelectRows 112
UnpreparedExecs 1
NetworkServerTime 0
BuffersSent 1
IduCount 0
The difference between the two queries is that after stored procedure call reader stays open, and row number statistics is not updated.
So the correct code would be like:
$connectionn.ResetStatistics()
$reader = $command.ExecuteReader()
$reader.Close() # !
$connection.RetrieveStatistics()
This is a followup to my first question "Porting “SQL” export to T-SQL".
I am working with a 3rd party program that I have no control over and I can not change. This program will export it's internal database in to a set of .sql each one with a format of:
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField], [BinaryField])
VALUES
(1 , 'Some Text' , 0x123456),
(2 , 'B' , NULL),
--(SNIP, it does this for 1000 records)
(999, 'E' , null);
(1000 , 'F' , null);
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField] , BinaryField)
VALUES
(1001 , 'asdg', null),
(1002 , 'asdf' , 0xdeadbeef),
(1003 , 'dfghdfhg' , null),
(1004 , 'sfdhsdhdshd' , null),
--(SNIP 1000 more lines)
This pattern continues till the .sql file has reached a file size set during the export, the export files are grouped by EXPORT_PATH\%Table_Name%\Export#.sql Where the # is a counter starting at 1.
Currently I have about 1.3GB data and I have it exporting in 1MB chunks (1407 files across 26 tables, All but 5 tables only have one file, the largest table has 207 files).
Right now I just have a simple C# program that reads each file in to ram then calls ExecuteNonQuery. The issue is I am averaging 60 sec/file which means it will take about 23 hrs for it to do the entire export.
I assume if I some how could format the files to be loaded with a BULK INSERT instead of a INSERT INTO it could go much faster. Is there any easy way to do this or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data.
Any other suggestions on how to speed up the insert into would also be appreciated.
UPDATE:
I ended up going with the parse and do a SqlBulkCopy method. It went from 1 file/min. to 1 file/sec.
Well, here is my "solution" for helping convert the data into a DataTable or otherwise (run it in LINQPad):
var i = "(null, 1 , 'Some''\n Text' , 0x123.456)";
var pat = #",?\s*(?:(?<n>null)|(?<w>[\w.]+)|'(?<s>.*)'(?!'))";
Regex.Matches(i, pat,
RegexOptions.IgnoreCase | RegexOptions.Singleline).Dump();
The match should be run once per value group (e.g. (a,b,etc)). Parsing of the results (e.g. conversion) is left to the caller and I have not tested it [much]. I would recommend creating the correctly-typed DataTable first -- although it may be possible to pass everything "as a string" to the database? -- and then use the information in the columns to help with the extraction process (possibly using type converters). For the captures: n is null, w is word (e.g. number), s is string.
Happy coding.
Apparently your data is always wrapped in parentheses and starts with a left parenthesis. You might want to use this rule to split(RemoveEmptyEntries) each of those lines and load it into a DataTable. Then you can use SqlBulkCopy to copy all at once into the database.
This approach would not necessarily be fail-safe, but it would be certainly faster.
Edit: Here's the way how you could get the schema for every table:
private static DataTable extractSchemaTable(IEnumerable<String> lines)
{
DataTable schema = null;
var insertLine = lines.SkipWhile(l => !l.StartsWith("INSERT INTO [")).Take(1).First();
var startIndex = insertLine.IndexOf("INSERT INTO [") + "INSERT INTO [".Length;
var endIndex = insertLine.IndexOf("]", startIndex);
var tableName = insertLine.Substring(startIndex, endIndex - startIndex);
using (var con = new SqlConnection("CONNECTION"))
{
using (var schemaCommand = new SqlCommand("SELECT * FROM " tableName, con))
{
con.Open();
using (var reader = schemaCommand.ExecuteReader(CommandBehavior.SchemaOnly))
{
schema = reader.GetSchemaTable();
}
}
}
return schema;
}
Then you simply need to iterate each line in the file, check if it starts with ( and split that line by Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries). Then you could add the resulting array into the created schema-table.
Something like this:
var allLines = System.IO.File.ReadAllLines(path);
DataTable result = extractSchemaTable(allLines);
for (int i = 0; i < allLines.Length; i++)
{
String line = allLines[i];
if (line.StartsWith("("))
{
String data = line.Substring(1, line.Length - (line.Length - line.LastIndexOf(")")) - 1);
var fields = data.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
// you might need to parse it to correct DataColumn.DataType
result.Rows.Add(fields);
}
}
A little background information: I have a table called table_a, which has 12 columns. I want to insert or update rows with values for 6 of these columns, while I don't wanna lose the data in the other 6 columns. And I wanna do this with a parameterized query in C#.
field1 is Unique.
> SELECT * FROM table_a;
+----+--------+--------+---+---------+---------+
| Id | field1*| field2 |...|field11 | field12 |
+----+--------+--------+---+---------+---------+
| 1 | AA | BB |...| KK | LL |
| 2 | AA | BB |...| KK | LL |
| 3 | AA | BB |...| KK | LL |
| 4 | AA | BB |...| KK | LL |
+----+--------+--------+---+---------+---------+
The Problem is, my first thought was to use REPLACE INTO, unfortunately this will delete the 6 not touched values:
> REPLACE INTO table_a (field1, ..., field6) VALUES ('AA', ...);
> REPLACE INTO table_a (field1, ..., field6) VALUES ('AB', ...);
+----+--------+--------+---+---------+---------+
| Id | field1*| field2 |...| field11 | field12 |
+----+--------+--------+---+---------+---------+
| 1 | AA | BB |...| NULL | NULL |
| 2 | AB | BB |...| NULL | NULL |
| 3 | AC | BB |...| KK | LL |
| 4 | AD | BB |...| KK | LL |
+----+--------+--------+---+---------+---------+
My second thought was to use INSERT INTO ... ON DUPLICATE KEY UPDATE, but then I'd have to bind the parameters a second time, the first time in the INSERT part and the second time in the UPDATE part, like this:
INSERT INTO table_a (field1, ..., field6)
VALUES(?, ..., ?)
ON DUPLICATE KEY UPDATE
field1 = ?, ..., field6 = ?;
That would preserve my data, but I have to bind the parameters twice.
The third option would be to create another two queries and use the SELECT and INSERT INTO/UPDATE pattern.
So, my question is, how do I do this the smart way?
Your second option sounds like a winner for single row updates.
Your third option is good if you insert/update many rows at once (as it will not matter much that you have two queries then - providing each does only what it is supposed to do).
UPDATE:
Digging through documentation one finds that you can bind once if you wish - you can refer to the originally bound values with VALUES()
UPDATE2:
Well, actually you can not get to the bound values with VALUES(column), so instead two suggestions that actually might help:
did you check about using named parameters (then you would not need to bound them twice)?
did you consider stored procedures?
I think you've listed all the available options, along with the pros/cons of each. As for the third option, you would probably want to wrap your two queries in a transaction to ensure that the operation remains atomic.
Hi let's say you want to modify field2 to field6.
why wouldn't you do:
replace into table_a select field1,new_value2,...,new_value6,field7,...,field12 from table_a where field1=filter_field1;
You put the new values and you get the others value by querying the table you're updating.