I'm trying to create a database (using SQLite in C#) with a table to store sentences, a table to store each individual word that is used in those sentences, and a junction table that relates the words to the sentences that they appear in.
I am then trying to populate the database with over 15 million sentences. I estimate that there are about 150 million inserts happening in my code.
Right now, my code is only doing a couple hundred sentences per second, which will take forever to get through this huge data set. How can I make this faster?
I tried putting the ENTIRE thing in a single transaction, but I'm not sure if that can work due to the huge amount of data. So I am using one transaction for each sentence.
Tables:
CREATE TABLE sentences ( sid INTEGER NOT NULL PRIMARY KEY, sentence TEXT );
CREATE TABLE words ( wid INTEGER NOT NULL PRIMARY KEY, dictform TEXT UNIQUE);
CREATE TABLE sentence_words( sid INTEGER NOT NULL, wid INTEGER NOT NULL, CONSTRAINT PK_sentence_words PRIMARY KEY ( sid, wid ), FOREIGN KEY(sid) REFERENCES Sentences(sid), FOREIGN KEY(wid) REFERENCES Words(wid));
Code:
while ((input = sr.ReadLine()) != null) //read in a new sentence
{
tr = m_dbConnection.BeginTransaction();
sql = "INSERT INTO sentences (sentence) VALUES(#sentence)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#sentence", input);
cmd.ExecuteNonQuery();
dict_words = jutils.onlyDict(input); //convert all words to their 'dictionary form'
var words = dict_words.Split(' ');
foreach (var wd in words) //for each word
{
sql = "INSERT or IGNORE INTO words (dictform) VALUES(#dictform)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#dictform", wd);
cmd.ExecuteNonQuery();
sql = "INSERT or IGNORE INTO sentence_words (sid, wid) VALUES((SELECT sid FROM sentences WHERE sentence = #sentence), (SELECT wid FROM words WHERE dictform = #dictform))";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#sentence", input);
cmd.Parameters.AddWithValue("#dictform", wd);
cmd.ExecuteNonQuery();
}
tr.Commit();
}
Always, we must avoid the 'one by one' SQL tasks when dealing with such large data.
In my case, (if memory is not burdened), load the data into a DataTable and manipulate(with LINQ) as you need, and finally use SqlBulkCopy at the end.
There's SqlBulkUpdate also but created by private author supporting from SQL 2008.
If under 2008, we can still quickly do this but have to create temporary SQL table and use UPDATE Join command.
SqlBulkCopy is really fast like just seconds.
You can group by a certain number of lines
int lines = -1;
int group = 500;
while ((input = sr.ReadLine()) != null)
{
lines++;
if (lines%group == 0) {
tr = m_dbConnection.BeginTransaction();
}
and commit each group
if (lines%group == group-1) {
tr.Commit();
}
}
if (lines%group >= 0 && lines%group < group-1) {
tr.Commit();
}
I had some old cases where I was forced to use .Net 4.0 and BulkCopy was not an option due to its async implementation.
Just put the entire thing in a single transaction. (The rollback journal stores the old version of any page changed by the transaction, so for an empty database, it cannot become big.)
Furthermore, all the searches for the sentence value are inefficient because the database has to go through the entire table. If that column had an index (explicitly, or implicitly with a UNIQUE constraint), these lookups would be much faster.
Related
I want to create simple database in runtime, fill it with data from internal resource and then read each record through loop. Previously I used LiteDb for that but I couldn't squeeze time anymore so
I choosed SQLite.
I think there are few things to improve I am not aware of.
Database creation process:
First step is to create table
using var create = transaction.Connection.CreateCommand();
create.CommandText = "CREATE TABLE tableName (Id TEXT PRIMARY KEY, Value TEXT) WITHOUT ROWID";
create.ExecuteNonQuery();
Next insert command is defined
var insert = transaction.Connection.CreateCommand();
insert.CommandText = "INSERT OR IGNORE INTO tableName VALUES (#Id, #Record)";
var idParam = insert.CreateParameter();
var valueParam = insert.CreateParameter();
idParam.ParameterName = "#" + IdColumn;
valueParam.ParameterName = "#" + ValueColumn;
insert.Parameters.Add(idParam);
insert.Parameters.Add(valueParam);
Through loop each value is inserted
idParameter.Value = key;
valueParameter.Value = value.ValueAsText;
insert.Parameters["#Id"] = idParameter;
insert.Parameters["#Value"] = valueParameter;
insert.ExecuteNonQuery();
Transaction commit transaction.Commit();
Create index
using var index = transaction.Connection.CreateCommand();
index.CommandText = "CREATE UNIQUE INDEX idx_tableName ON tableName(Id);";
index.ExecuteNonQuery();
And after that i perform milion selects (to retrieve single value):
using var command = _connection.CreateCommand();
command.CommandText = "SELECT Value FROM tableName WHERE Id = #id;";
var param = command.CreateParameter();
param.ParameterName = "#id";
param.Value = id;
command.Parameters.Add(param);
return command.ExecuteReader(CommandBehavior.SingleResult).ToString();
For all select's one connection is shared and never closed. Insert is quite fast (less then minute) but select's are very troublesome here. Is there a way to improve them?
Table is quite big (around ~2 milions records) and Value contains quite heavy serialized objects.
System.Data.SQLite provider is used and connection string contains this additional options: Version=3;Journal Mode=Off;Synchronous=off;
If you go for performance, you need to consider this: each independent SELECT command is a roundtrip to the DB with some extra costs. It's similar to a N+1 select problem in case of parent-child relations.
The best thing you can do is to get a LIST of items (values):
SELECT Value FROM tableName WHERE Id IN (1, 2, 3, 4, ...);
Here's a link on how to code that: https://www.mikesdotnetting.com/article/116/parameterized-in-clauses-with-ado-net-and-linq
You could have the select command not recreated for every Id but created once and only executed for every Id. From your code it seems every select is CreateCommand/CreateParameters and so on. See this for example: https://learn.microsoft.com/en-us/dotnet/api/system.data.idbcommand.prepare?view=net-5.0 - you run .Prepare() once and then only execute (they don't need to be NonQuery)
you could then try to see if you can be faster with ExecuteScalar and not having reader created for one data result, like so: https://learn.microsoft.com/en-us/dotnet/api/system.data.idbcommand.executescalar?view=net-5.0
If scalar will not prove to be faster then you could try to use .SingleRow instead of .SingleResult in your ExecuteReader for possible performance optimisations. According to this: https://learn.microsoft.com/en-us/dotnet/api/system.data.commandbehavior?view=net-5.0 it might work. I doubt that but if first two don't help, why not try it too.
I have been asked to look at finding the most efficient way to take a DataTable input and write it to a SQL Server table using C#. The snag is that the solution must use ODBC Connections throughout, this rules out sqlBulkCopy. The solution must also work on all SQL Server versions back to SQL Server 2008 R2.
I am thinking that the best approach would be to use batch inserts of 1000 rows at a time using the following SQL syntax:
INSERT INTO dbo.Table1(Field1, Field2)
SELECT Value1, Value2
UNION
SELECT Value1, Value2
I have already written the code the check if a table corresponding to the DataTable input already exists on the SQL Server and to create one if it doesn't.
I have also written the code to create the INSERT statement itself. What I am struggling with is how to dynamically build the SELECT statements from the rows in the data table. How can I access the values in the rows to build my SELECT statement? I think I will also need to check the data type of each column in order to determine whether the values need to be enclosed in single quotes (') or not.
Here is my current code:
public bool CopyDataTable(DataTable sourceTable, OdbcConnection targetConn, string targetTable)
{
OdbcTransaction tran = null;
string[] selectStatement = new string[sourceTable.Rows.Count];
// Check if targetTable exists, create it if it doesn't
if (!TableExists(targetConn, targetTable))
{
bool created = CreateTableFromDataTable(targetConn, sourceTable);
if (!created)
return false;
}
try
{
// Prepare insert statement based on sourceTable
string insertStatement = string.Format("INSERT INTO [dbo].[{0}] (", targetTable);
foreach (DataColumn dataColumn in sourceTable.Columns)
{
insertStatement += dataColumn + ",";
}
insertStatement += insertStatement.TrimEnd(',') + ") ";
// Open connection to target db
using (targetConn)
{
if (targetConn.State != ConnectionState.Open)
targetConn.Open();
tran = targetConn.BeginTransaction();
for (int i = 0; i < sourceTable.Rows.Count; i++)
{
DataRow row = sourceTable.Rows[i];
// Need to iterate through columns in row, getting values and data types and building a SELECT statement
selectStatement[i] = "SELECT ";
}
insertStatement += string.Join(" UNION ", selectStatement);
using (OdbcCommand cmd = new OdbcCommand(insertStatement, targetConn, tran))
{
cmd.ExecuteNonQuery();
}
tran.Commit();
return true;
}
}
catch
{
tran.Rollback();
return false;
}
}
Any advice would be much appreciated. Also if there is a simpler approach than the one I am suggesting then any details of that would be great.
Ok since we cannot use stored procedures or Bulk Copy ; when I modelled the various approaches a couple of years ago, the key determinant to performance was the number of calls to the server. So batching a set of MERGE or INSERT statements into a single call separated by semi-colons was found to be the fastest method. I ended up batching my SQL statements. I think the max size of a SQL statement was 32k so I chopped up my batch into units of that size.
(Note - use StringBuilder instead of concatenating strings manually - it has a beneficial effect on performance)
Psuedo-code
string sqlStatement = "INSERT INTO Tab1 VALUES {0},{1},{2}";
StringBuilder sqlBatch = new StringBuilder();
foreach(DataRow row in myDataTable)
{
sqlBatch.AppendLine(string.Format(sqlStatement, row["Field1"], row["Field2"], row["Field3"]));
sqlBatch.Append(";");
}
myOdbcConnection.ExecuteSql(sqlBatch.ToString());
You need to deal with batch size complications, and formatting of the correct field data types in the string-replace step, but otherwise this will be the best performance.
Marked solution of PhillipH is open for several mistakes and SQL injection.
Normally you should build a DbCommand with parameters and execute this instead of executing a self build SQL statement.
The CommandText must be "INSERT INTO Tab1 VALUES ?,?,?" for ODBC and OLEDB, SqlClient needs named parameters ("#<Name>").
Parameters should be added with the dimensions of underlaying column.
I'm building a .NET application that talks to an Oracle 11g database. I am trying to take data from Excel files provided by a third party and upsert (UPDATE record if exists, INSERT if not), but am having some trouble with performance.
These Excel files are to replace tariff codes and descriptions, so there are a couple thousand records in each file.
| Tariff | Description |
|----------------------------------------|
| 1234567890 | 'Sample description here' |
I did some research on bulk inserting, and even wrote a function that opens a transaction in the application, executes a bunch of UPDATE or INSERT statements, then commits. Unfortunately, that takes a long time and prolongs the session between the application and the database.
public void UpsertMultipleRecords(string[] updates, string[] inserts) {
OleDbConnection conn = new OleDbConnection("connection string here");
conn.Open();
OleDbTransaction trans = conn.BeginTransaction();
try {
for (int i = 0; i < updates.Length; i++) {
OleDbCommand cmd = new OleDbCommand(updates[i], conn);
cmd.Transaction = trans;
int count = cmd.ExecuteNonQuery();
if (count < 1) {
cmd = new OleDbCommand(inserts[i], conn);
cmd.Transaction = trans;
}
}
trans.Commit();
} catch (OleDbException ex) {
trans.Rollback();
} finally {
conn.Close();
}
}
I found via Ask Tom that an efficient way of doing something like this is using an Oracle MERGE statement, implemented in 9i. From what I understand, this is only possible using two existing tables in Oracle. I've tried but don't understand temporary tables or if that's possible. If I create a new table that just holds my data when I MERGE, I still need a solid way of bulk inserting.
The way I usually upload my files to merge, is by first inserting into a load table with sql*loader and then executing a merge statement from the load table into the target table.
A temporary table will only retain it's contents for the duration of the session. I expect sql*loader to end the session upon completion, so better use a normal table that you truncate after the merge.
merge into target_table t
using load_table l on (t.key = l.key) -- brackets are mandatory
when matched then update
set t.col = l.col
, t.col2 = l.col2
, t.col3 = l.col3
when not matched then insert
(t.key, t.col, t.col2, t.col3)
values
(l.key, l.col, l.col2, l.col3)
I am using C# and using SqlBulkCopy. I have a problem though. I need to do a mass insert into one table then another mass insert into another table.
These 2 have a PK/FK relationship.
Table A
Field1 -PK auto incrementing (easy to do SqlBulkCopy as straight forward)
Table B
Field1 -PK/FK - This field makes the relationship and is also the PK of this table. It is not auto incrementing and needs to have the same id as to the row in Table A.
So these tables have a one to one relationship but I am unsure how to get back all those PK Id that the mass insert made since I need them for Table B.
Edit
Could I do something like this?
SELECT *
FROM Product
WHERE NOT EXISTS (SELECT * FROM ProductReview WHERE Product.ProductId = ProductReview.ProductId AND Product.Qty = NULL AND Product.ProductName != 'Ipad')
This should find all the rows that where just inserted with the sql bulk copy. I am not sure how to take the results from this then do a mass insert with them from a SP.
The only problem I can see with this is that if a user is doing the records one at a time and a this statement runs at the same time it could try to insert a row twice into the "Product Review Table".
So say I got like one user using the manual way and another user doing the mass way at about the same time.
manual way.
1. User submits data
2. Linq to sql Product object is made and filled with the data and submited.
3. this object now contains the ProductId
4. Another linq to sql object is made for the Product review table and is inserted(Product Id from step 3 is sent along).
Mass way.
1. User grabs data from a user sharing the data.
2. All Product rows from the sharing user are grabbed.
3. SQL Bulk copy insert on Product rows happens.
4. My SP selects all rows that only exist in the Product table and meets some other conditions
5. Mass insert happens with those rows.
So what happens if step 3(manual way) is happening at the same time as step 4(mass way). I think it would try to insert the same row twice causing a primary constraint execption.
In that scenario, I would use SqlBulkCopy to insert into a staging table (i.e. one that looks like the data I want to import, but isn't part of the main transactional tables), and then at the DB to a INSERT/SELECT to move the data into the first real table.
Now I have two choices depending on the server version; I could do a second INSERT/SELECT to the second real table, or I could use the INSERT/OUTPUT clause to do the second insert , using the identity rows from the table.
For example:
-- dummy schema
CREATE TABLE TMP (data varchar(max))
CREATE TABLE [Table1] (id int not null identity(1,1), data varchar(max))
CREATE TABLE [Table2] (id int not null identity(1,1), id1 int not null, data varchar(max))
-- imagine this is the SqlBulkCopy
INSERT TMP VALUES('abc')
INSERT TMP VALUES('def')
INSERT TMP VALUES('ghi')
-- now push into the real tables
INSERT [Table1]
OUTPUT INSERTED.id, INSERTED.data INTO [Table2](id1,data)
SELECT data FROM TMP
If your app allows it, you could add another column in which you store an identifier of the bulk insert (a guid for example). You would set this id explicitly.
Then after the bulk insert, you just select the rows that have that identifier.
I had the same issue where I had to get back ids of the rows inserted with SqlBulkCopy.
My ID column was an identity column.
Solution:
I have inserted 500+ rows with bulk copy, and then selected them back with the following query:
SELECT TOP InsertedRowCount *
FROM MyTable
ORDER BY ID DESC
This query returns the rows I have just inserted with their ids. In my case I had another unique column. So I selected that column and id. Then mapped them with a IDictionary like so:
IDictionary<string, int> mymap = new Dictionary<string, int>()
mymap[Name] = ID
Hope this helps.
My approach is similar to what RiceRiceBaby described, except one important thing to add is that the call to retrieve Max(Id) needs to be a part of a transaction, along with the call to SqlBulkCopy.WriteToServer. Otherwise, someone else may insert during your transaction and this would make your Id's incorrect. Here is my code:
public static void BulkInsert<T>(List<ColumnInfo> columnInfo, List<T> data, string
destinationTableName, SqlConnection conn = null, string idColumn = "Id")
{
NLogger logger = new NLogger();
var closeConn = false;
if (conn == null)
{
closeConn = true;
conn = new SqlConnection(_connectionString);
conn.Open();
}
SqlTransaction tran =
conn.BeginTransaction(System.Data.IsolationLevel.Serializable);
try
{
var options = SqlBulkCopyOptions.KeepIdentity;
var sbc = new SqlBulkCopy(conn, options, tran);
var command = new SqlCommand(
$"SELECT Max({idColumn}) from {destinationTableName};", conn,
tran);
var id = command.ExecuteScalar();
int maxId = 0;
if (id != null && id != DBNull.Value)
{
maxId = Convert.ToInt32(id);
}
data.ForEach(d =>
{
maxId++;
d.GetType().GetProperty(idColumn).SetValue(d, maxId);
});
var dt = ConvertToDataTable(columnInfo, data);
sbc.DestinationTableName = destinationTableName;
foreach (System.Data.DataColumn dc in dt.Columns)
{
sbc.ColumnMappings.Add(dc.ColumnName, dc.ColumnName);
}
sbc.WriteToServer(dt);
tran.Commit();
if(closeConn)
{
conn.Close();
conn = null;
}
}
catch (Exception ex)
{
tran.Rollback();
logger.Write(LogLevel.Error, $#"An error occurred while performing a bulk
insert into table {destinationTableName}. The entire
transaction has been rolled back.
{ex.ToString()}");
throw ex;
}
}
Depending on your needs and how much control you have of the tables, you may want to consider using UNIQUEIDENTIFIERs (Guids) instead of your IDENTITY primary keys. This moves key management outside of the database and into your application. There are some serious tradeoffs to this approach, so it may not meet your needs. But it may be worth considering. If you know for sure that you'll be pumping a lot of data into your tables via bulk-insert, it is often really handy to have those keys managed in your object model rather than your application relying on the database to give you back the data.
You could also take a hybrid approach with staging tables as suggested before. Get the data into those tables using GUIDs for the relationships, and then via SQL statements you could get the integer foreign keys in order and pump data into your production tables.
I would:
Turn on identity insert on the table
Grab the Id of the last row of the table
Loop from (int i = Id; i < datable.rows.count+1; i++)
In the loop, assign the Id property of your datable to i+1.
Run your SQL bulk insert with your keep identity turned on.
Turn identity insert back off
I think that's the safest way to get your ids on an SQL bulk insert because it will prevent mismatched ids that could caused by the application be executed on another thread.
Disclaimer: I'm the owner of the project C# Bulk Operations
The library overcome SqlBulkCopy limitations and add flexible features like output inserted identity value.
Behind the code, it does exactly like the accepted answer but way easier to use.
var bulk = new BulkOperation(connection);
// Output Identity
bulk.ColumnMappings.Add("ProductID", ColumnMappingDirectionType.Output);
// ... Column Mappings...
bulk.BulkInsert(dt);
I am working on a console application to insert data to a MS SQL Server 2005 database. I have a list of objects to be inserted. Here I use Employee class as example:
List<Employee> employees;
What I can do is to insert one object at time like this:
foreach (Employee item in employees)
{
string sql = #"INSERT INTO Mytable (id, name, salary)
values ('#id', '#name', '#salary')";
// replace #par with values
cmd.CommandText = sql; // cmd is IDbCommand
cmd.ExecuteNonQuery();
}
Or I can build a balk insert query like this:
string sql = #"INSERT INTO MyTable (id, name, salary) ";
int count = employees.Count;
int index = 0;
foreach (Employee item in employees)
{
sql = sql + string.format(
"SELECT {0}, '{1}', {2} ",
item.ID, item.Name, item.Salary);
if ( index != (count-1) )
sql = sql + " UNION ALL ";
index++
}
cmd.CommandType = sql;
cmd.ExecuteNonQuery();
I guess the later case is going to insert rows of data at once. However, if I have
several ks of data, is there any limit for SQL query string?
I am not sure if one insert with multiple rows is better than one insert with one row of data, in terms of performance?
Any suggestions to do it in a better way?
Actually, the way you have it written, your first option will be faster.
Your second example has a problem in it. You are doing sql = + sql + etc. This is going to cause a new string object to be created for each iteration of the loop. (Check out the StringBuilder class). Technically, you are going to be creating a new string object in the first instance too, but the difference is that it doesn't have to copy all the information from the previous string option over.
The way you have it set up, SQL Server is going to have to potentially evaluate a massive query when you finally send it which is definitely going to take some time to figure out what it is supposed to do. I should state, this is dependent on how large the number of inserts you need to do. If n is small, you are probably going to be ok, but as it grows your problem will only get worse.
Bulk inserts are faster than individual ones due to how SQL server handles batch transactions. If you are going to insert data from C# you should take the first approach and wrap say every 500 inserts into a transaction and commit it, then do the next 500 and so on. This also has the advantage that if a batch fails, you can trap those and figure out what went wrong and re-insert just those. There are other ways to do it, but that would definately be an improvement over the two examples provided.
var iCounter = 0;
foreach (Employee item in employees)
{
if (iCounter == 0)
{
cmd.BeginTransaction;
}
string sql = #"INSERT INTO Mytable (id, name, salary)
values ('#id', '#name', '#salary')";
// replace #par with values
cmd.CommandText = sql; // cmd is IDbCommand
cmd.ExecuteNonQuery();
iCounter ++;
if(iCounter >= 500)
{
cmd.CommitTransaction;
iCounter = 0;
}
}
if(iCounter > 0)
cmd.CommitTransaction;
In MS SQL Server 2008 you can create .Net table-UDT that will contain your table
CREATE TYPE MyUdt AS TABLE (Id int, Name nvarchar(50), salary int)
then, you can use this UDT in your stored procedures and your с#-code to batch-inserts.
SP:
CREATE PROCEDURE uspInsert
(#MyTvp AS MyTable READONLY)
AS
INSERT INTO [MyTable]
SELECT * FROM #MyTvp
C# (imagine that records you need to insert already contained in Table "MyTable" of DataSet ds):
using(conn)
{
SqlCommand cmd = new SqlCommand("uspInsert", conn);
cmd.CommandType = CommandType.StoredProcedure;
SqlParameter myParam = cmd.Parameters.AddWithValue
("#MyTvp", ds.Tables["MyTable"]);
myParam.SqlDbType = SqlDbType.Structured;
myParam.TypeName = "dbo.MyUdt";
// Execute the stored procedure
cmd.ExecuteNonQuery();
}
So, this is the solution.
Finally I want to prevent you from using code like yours (building the strings and then execute this string), because this way of executing may be used for SQL-Injections.
look at this thread,
I've answered there about table valued parameter.
Bulk-copy is usually faster than doing inserts on your own.
If you still want to do it in one of your suggested ways you should make it so that you can easily change the size of the queries you send to the server. That way you can optimize for speed in your production environment later on. Query times may v ary alot depending on the query size.
The batch size for a SQL Server query is listed at being 65,536 * the network packet size. The network packet size is by default 4kbs but can be changed. Check out the Maximum capacity article for SQL 2008 to get the scope. SQL 2005 also appears to have the same limit.