I have been asked to look at finding the most efficient way to take a DataTable input and write it to a SQL Server table using C#. The snag is that the solution must use ODBC Connections throughout, this rules out sqlBulkCopy. The solution must also work on all SQL Server versions back to SQL Server 2008 R2.
I am thinking that the best approach would be to use batch inserts of 1000 rows at a time using the following SQL syntax:
INSERT INTO dbo.Table1(Field1, Field2)
SELECT Value1, Value2
UNION
SELECT Value1, Value2
I have already written the code the check if a table corresponding to the DataTable input already exists on the SQL Server and to create one if it doesn't.
I have also written the code to create the INSERT statement itself. What I am struggling with is how to dynamically build the SELECT statements from the rows in the data table. How can I access the values in the rows to build my SELECT statement? I think I will also need to check the data type of each column in order to determine whether the values need to be enclosed in single quotes (') or not.
Here is my current code:
public bool CopyDataTable(DataTable sourceTable, OdbcConnection targetConn, string targetTable)
{
OdbcTransaction tran = null;
string[] selectStatement = new string[sourceTable.Rows.Count];
// Check if targetTable exists, create it if it doesn't
if (!TableExists(targetConn, targetTable))
{
bool created = CreateTableFromDataTable(targetConn, sourceTable);
if (!created)
return false;
}
try
{
// Prepare insert statement based on sourceTable
string insertStatement = string.Format("INSERT INTO [dbo].[{0}] (", targetTable);
foreach (DataColumn dataColumn in sourceTable.Columns)
{
insertStatement += dataColumn + ",";
}
insertStatement += insertStatement.TrimEnd(',') + ") ";
// Open connection to target db
using (targetConn)
{
if (targetConn.State != ConnectionState.Open)
targetConn.Open();
tran = targetConn.BeginTransaction();
for (int i = 0; i < sourceTable.Rows.Count; i++)
{
DataRow row = sourceTable.Rows[i];
// Need to iterate through columns in row, getting values and data types and building a SELECT statement
selectStatement[i] = "SELECT ";
}
insertStatement += string.Join(" UNION ", selectStatement);
using (OdbcCommand cmd = new OdbcCommand(insertStatement, targetConn, tran))
{
cmd.ExecuteNonQuery();
}
tran.Commit();
return true;
}
}
catch
{
tran.Rollback();
return false;
}
}
Any advice would be much appreciated. Also if there is a simpler approach than the one I am suggesting then any details of that would be great.
Ok since we cannot use stored procedures or Bulk Copy ; when I modelled the various approaches a couple of years ago, the key determinant to performance was the number of calls to the server. So batching a set of MERGE or INSERT statements into a single call separated by semi-colons was found to be the fastest method. I ended up batching my SQL statements. I think the max size of a SQL statement was 32k so I chopped up my batch into units of that size.
(Note - use StringBuilder instead of concatenating strings manually - it has a beneficial effect on performance)
Psuedo-code
string sqlStatement = "INSERT INTO Tab1 VALUES {0},{1},{2}";
StringBuilder sqlBatch = new StringBuilder();
foreach(DataRow row in myDataTable)
{
sqlBatch.AppendLine(string.Format(sqlStatement, row["Field1"], row["Field2"], row["Field3"]));
sqlBatch.Append(";");
}
myOdbcConnection.ExecuteSql(sqlBatch.ToString());
You need to deal with batch size complications, and formatting of the correct field data types in the string-replace step, but otherwise this will be the best performance.
Marked solution of PhillipH is open for several mistakes and SQL injection.
Normally you should build a DbCommand with parameters and execute this instead of executing a self build SQL statement.
The CommandText must be "INSERT INTO Tab1 VALUES ?,?,?" for ODBC and OLEDB, SqlClient needs named parameters ("#<Name>").
Parameters should be added with the dimensions of underlaying column.
Related
I want to create simple database in runtime, fill it with data from internal resource and then read each record through loop. Previously I used LiteDb for that but I couldn't squeeze time anymore so
I choosed SQLite.
I think there are few things to improve I am not aware of.
Database creation process:
First step is to create table
using var create = transaction.Connection.CreateCommand();
create.CommandText = "CREATE TABLE tableName (Id TEXT PRIMARY KEY, Value TEXT) WITHOUT ROWID";
create.ExecuteNonQuery();
Next insert command is defined
var insert = transaction.Connection.CreateCommand();
insert.CommandText = "INSERT OR IGNORE INTO tableName VALUES (#Id, #Record)";
var idParam = insert.CreateParameter();
var valueParam = insert.CreateParameter();
idParam.ParameterName = "#" + IdColumn;
valueParam.ParameterName = "#" + ValueColumn;
insert.Parameters.Add(idParam);
insert.Parameters.Add(valueParam);
Through loop each value is inserted
idParameter.Value = key;
valueParameter.Value = value.ValueAsText;
insert.Parameters["#Id"] = idParameter;
insert.Parameters["#Value"] = valueParameter;
insert.ExecuteNonQuery();
Transaction commit transaction.Commit();
Create index
using var index = transaction.Connection.CreateCommand();
index.CommandText = "CREATE UNIQUE INDEX idx_tableName ON tableName(Id);";
index.ExecuteNonQuery();
And after that i perform milion selects (to retrieve single value):
using var command = _connection.CreateCommand();
command.CommandText = "SELECT Value FROM tableName WHERE Id = #id;";
var param = command.CreateParameter();
param.ParameterName = "#id";
param.Value = id;
command.Parameters.Add(param);
return command.ExecuteReader(CommandBehavior.SingleResult).ToString();
For all select's one connection is shared and never closed. Insert is quite fast (less then minute) but select's are very troublesome here. Is there a way to improve them?
Table is quite big (around ~2 milions records) and Value contains quite heavy serialized objects.
System.Data.SQLite provider is used and connection string contains this additional options: Version=3;Journal Mode=Off;Synchronous=off;
If you go for performance, you need to consider this: each independent SELECT command is a roundtrip to the DB with some extra costs. It's similar to a N+1 select problem in case of parent-child relations.
The best thing you can do is to get a LIST of items (values):
SELECT Value FROM tableName WHERE Id IN (1, 2, 3, 4, ...);
Here's a link on how to code that: https://www.mikesdotnetting.com/article/116/parameterized-in-clauses-with-ado-net-and-linq
You could have the select command not recreated for every Id but created once and only executed for every Id. From your code it seems every select is CreateCommand/CreateParameters and so on. See this for example: https://learn.microsoft.com/en-us/dotnet/api/system.data.idbcommand.prepare?view=net-5.0 - you run .Prepare() once and then only execute (they don't need to be NonQuery)
you could then try to see if you can be faster with ExecuteScalar and not having reader created for one data result, like so: https://learn.microsoft.com/en-us/dotnet/api/system.data.idbcommand.executescalar?view=net-5.0
If scalar will not prove to be faster then you could try to use .SingleRow instead of .SingleResult in your ExecuteReader for possible performance optimisations. According to this: https://learn.microsoft.com/en-us/dotnet/api/system.data.commandbehavior?view=net-5.0 it might work. I doubt that but if first two don't help, why not try it too.
I'm trying to create a database (using SQLite in C#) with a table to store sentences, a table to store each individual word that is used in those sentences, and a junction table that relates the words to the sentences that they appear in.
I am then trying to populate the database with over 15 million sentences. I estimate that there are about 150 million inserts happening in my code.
Right now, my code is only doing a couple hundred sentences per second, which will take forever to get through this huge data set. How can I make this faster?
I tried putting the ENTIRE thing in a single transaction, but I'm not sure if that can work due to the huge amount of data. So I am using one transaction for each sentence.
Tables:
CREATE TABLE sentences ( sid INTEGER NOT NULL PRIMARY KEY, sentence TEXT );
CREATE TABLE words ( wid INTEGER NOT NULL PRIMARY KEY, dictform TEXT UNIQUE);
CREATE TABLE sentence_words( sid INTEGER NOT NULL, wid INTEGER NOT NULL, CONSTRAINT PK_sentence_words PRIMARY KEY ( sid, wid ), FOREIGN KEY(sid) REFERENCES Sentences(sid), FOREIGN KEY(wid) REFERENCES Words(wid));
Code:
while ((input = sr.ReadLine()) != null) //read in a new sentence
{
tr = m_dbConnection.BeginTransaction();
sql = "INSERT INTO sentences (sentence) VALUES(#sentence)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#sentence", input);
cmd.ExecuteNonQuery();
dict_words = jutils.onlyDict(input); //convert all words to their 'dictionary form'
var words = dict_words.Split(' ');
foreach (var wd in words) //for each word
{
sql = "INSERT or IGNORE INTO words (dictform) VALUES(#dictform)";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#dictform", wd);
cmd.ExecuteNonQuery();
sql = "INSERT or IGNORE INTO sentence_words (sid, wid) VALUES((SELECT sid FROM sentences WHERE sentence = #sentence), (SELECT wid FROM words WHERE dictform = #dictform))";
cmd = new SQLiteCommand(sql, m_dbConnection);
cmd.Parameters.AddWithValue("#sentence", input);
cmd.Parameters.AddWithValue("#dictform", wd);
cmd.ExecuteNonQuery();
}
tr.Commit();
}
Always, we must avoid the 'one by one' SQL tasks when dealing with such large data.
In my case, (if memory is not burdened), load the data into a DataTable and manipulate(with LINQ) as you need, and finally use SqlBulkCopy at the end.
There's SqlBulkUpdate also but created by private author supporting from SQL 2008.
If under 2008, we can still quickly do this but have to create temporary SQL table and use UPDATE Join command.
SqlBulkCopy is really fast like just seconds.
You can group by a certain number of lines
int lines = -1;
int group = 500;
while ((input = sr.ReadLine()) != null)
{
lines++;
if (lines%group == 0) {
tr = m_dbConnection.BeginTransaction();
}
and commit each group
if (lines%group == group-1) {
tr.Commit();
}
}
if (lines%group >= 0 && lines%group < group-1) {
tr.Commit();
}
I had some old cases where I was forced to use .Net 4.0 and BulkCopy was not an option due to its async implementation.
Just put the entire thing in a single transaction. (The rollback journal stores the old version of any page changed by the transaction, so for an empty database, it cannot become big.)
Furthermore, all the searches for the sentence value are inefficient because the database has to go through the entire table. If that column had an index (explicitly, or implicitly with a UNIQUE constraint), these lookups would be much faster.
I dont know how to do this query in c#.
There are two databases and each one has a table required for this query. I need to take the data from one database table and update the other database table with the corresponding payrollID.
I have two tables in seperate databases, Employee which is in techData database and strStaff in QLS database. In the employee table I have StaffID but need to pull the PayrollID from strStaff.
Insert payrollID into Employee where staffID from strStaff = staffID from Employee
However I need to get the staffID and PayrollID from strStaff before I can do the insert query.
This is what I have got so far but it wont work.
cn.ConnectionString = ConfigurationManager.ConnectionStrings["PayrollPlusConnectionString"].ConnectionString;
cmd.Connection = cn;
cmd.CommandText = "Select StaffId, PayrollID From [strStaff] Where (StaffID = #StaffID)";
cmd.Parameters.AddWithValue("#StaffID", staffID);
//Open the connection to the database
cn.Open();
// Execute the sql.
dr = cmd.ExecuteReader();
// Read all of the rows generated by the command (in this case only one row).
For each (dr.Read()) {
cmd.CommandText = "Insert into Employee, where StaffID = #StaffID";
}
// Close your connection to the DB.
dr.Close();
cn.Close();
Assuminig, you want to add data to existing table, you have to use UPDATE + SELECT statement (as i mentioned in a comment to the question). It might look like:
UPDATE emp SET payrollID = sta.peyrollID
FROM Emplyoee AS emp INNER JOIN strStaff AS sta ON emp.staffID = sta.staffID
I have added some clarity to your question: the essential part is that you want to create a C# procedure to accomplish your task (not using SQL Server Management Studio, SSIS, bulk insert, etc). Pertinent to this, there will be 2 different connection objects, and 2 different SQL statements to execute on those connections.
The first task would be retrieving data from the first DB (for certainty let's call it source DB/Table) using SELECT SQL statement, and storing it in some temporary data structure, either per row (as in your code), or the entire table using .NET DataTable object, which will give substantial performance boost. For this purpose, you should use the first connection object to source DB/Table (btw, you can close that connection as soon as you get the data).
The second task would be inserting the data into second DB (target DB/Table), though from your business logic it's a bit unclear how to handle possible data conflicts if records with identical ID already exist in the target DB/Table (some clarity needed). To complete this operation you should use the second connection object and second SQL query.
The sample code snippet to perform the first task, which allows retrieving entire data into .NET/C# DataTable object in a single pass is shown below:
private static DataTable SqlReadDB(string ConnString, string SQL)
{
DataTable _dt;
try
{
using (SqlConnection _connSql = new SqlConnection(ConnString))
{
using (SqlCommand _commandl = new SqlCommand(SQL, _connSql))
{
_commandSql.CommandType = CommandType.Text;
_connSql.Open();
using (SqlCeDataReader _dataReaderSql = _commandSql.ExecuteReader(CommandBehavior.CloseConnection))
{
_dt = new DataTable();
_dt.Load(_dataReaderSqlCe);
_dataReaderSql.Close();
}
}
_connSqlCe.Close();
return _dt;
}
}
catch { return null; }
}
The second part (adding data to target DB/Table) you should code based on the clarified business logic (i.e. data conflicts resolution: do you want to update existing record or skip, etc). Just iterate through the data rows in DataTable object and perform either INSERT or UPDATE SQL operations.
Hope this may help. Kind regards,
I have a simple problem with a not so simple solution... I am currently inserting some data into a database like this:
kompenzacijeDataSet.KompenzacijeRow kompenzacija = kompenzacijeDataSet.Kompenzacije.NewKompenzacijeRow();
kompenzacija.Datum = DateTime.Now;
kompenzacija.PodjetjeID = stranka.id;
kompenzacija.Znesek = Decimal.Parse(tbZnesek.Text);
kompenzacijeDataSet.Kompenzacije.Rows.Add(kompenzacija);
kompenzacijeDataSetTableAdapters.KompenzacijeTableAdapter kompTA = new kompenzacijeDataSetTableAdapters.KompenzacijeTableAdapter();
kompTA.Update(this.kompenzacijeDataSet.Kompenzacije);
this.currentKompenzacijaID = LastInsertID(kompTA.Connection);
The last line is important. Why do I supply a connection? Well there is a SQLite function called last_insert_rowid() that you can call and get the last insert ID. Problem is it is bound to a connection and .NET seems to be reopening and closing connections for every dataset operation. I thought getting the connection from a table adapter would change things. But it doesn't.
Would anyone know how to solve this? Maybe where to get a constant connection from? Or maybe something more elegant?
Thank you.
EDIT:
This is also a problem with transactions, I would need the same connection if I would want to use transactions, so that is also a problem...
Using C# (.net 4.0) with SQLite, the SQLiteConnection class has a property LastInsertRowId that equals the Primary Integer Key of the most recently inserted (or updated) element.
The rowID is returned if the table doesn't have a primary integer key (in this case the rowID is column is automatically created).
See https://www.sqlite.org/c3ref/last_insert_rowid.html for more.
As for wrapping multiple commands in a single transaction, any commands entered after the transaction begins and before it is committed are part of one transaction.
long rowID;
using (SQLiteConnection con = new SQLiteConnection([datasource])
{
SQLiteTransaction transaction = null;
transaction = con.BeginTransaction();
... [execute insert statement]
rowID = con.LastInsertRowId;
transaction.Commit()
}
select last_insert_rowid();
And you will need to execute it as a scalar query.
string sql = #"select last_insert_rowid()";
long lastId = (long)command.ExecuteScalar(sql); // Need to type-cast since `ExecuteScalar` returns an object.
last_insert_rowid() is part of the solution. It returns a row number, not the actual ID.
cmd = CNN.CreateCommand();
cmd.CommandText = "SELECT last_insert_rowid()";
object i = cmd.ExecuteScalar();
cmd.CommandText = "SELECT " + ID_Name + " FROM " + TableName + " WHERE rowid=" + i.ToString();
i = cmd.ExecuteScalar();
I'm using Microsoft.Data.Sqlite package and I do not see a LastInsertRowId property. But you don't have to create a second trip to database to get the last id. Instead, combine both sql statements into a single string.
string sql = #"
insert into MyTable values (null, #name);
select last_insert_rowid();";
using (var cmd = conn.CreateCommand()) {
cmd.CommandText = sql;
cmd.Parameters.Add("#name", SqliteType.Text).Value = "John";
int lastId = Convert.ToInt32(cmd.ExecuteScalar());
}
There seems to be answers to both Microsoft's reference and SQLite's reference and that is the reason some people are getting LastInsertRowId property to work and others aren't.
Personally I don't use an PK as it's just an alias for the rowid column. Using the rowid is around twice as fast as one that you create. If I have a TEXT column for a PK I still use rowid and just make the text column unique. (for SQLite 3 only. You need your own for v1 & v2 as vacuum will alter rowid numbers)
That said, the way to get the information from a record in the last insert is the code below. Since the function does a left join to itself I LIMIT it to 1 just for speed, even if you don't there will only be 1 record from the main SELECT statement.
SELECT my_primary_key_column FROM my_table
WHERE rowid in (SELECT last_insert_rowid() LIMIT 1);
The SQLiteConnection object has a property for that, so there is not need for additional query.
After INSERT you just my use LastInsertRowId property of your SQLiteConnection object that was used for INSERT command.
Type of LastInsertRowId property is Int64.
Off course, as you already now, for auto increment to work the primary key on table must be set to be AUTOINCREMENT field, which is another topic.
database = new SQLiteConnection(databasePath);
public int GetLastInsertId()
{
return (int)SQLite3.LastInsertRowid(database.Handle);
}
# How about just running 2x SQL statements together using Execute Scalar?
# Person is a object that has an Id and Name property
var connString = LoadConnectionString(); // get connection string
using (var conn = new SQLiteConnection(connString)) // connect to sqlite
{
// insert new record and get Id of inserted record
var sql = #"INSERT INTO People (Name) VALUES (#Name);
SELECT Id FROM People
ORDER BY Id DESC";
var lastId = conn.ExecuteScalar(sql, person);
}
In EF Core 5 you can get ID in the object itself without using any "last inserted".
For example:
var r = new SomeData() { Name = "New Row", ...};
dbContext.Add(r);
dbContext.SaveChanges();
Console.WriteLine(r.ID);
you would get new ID without thinking of using correct connection or thread-safety etc.
If you're using the Microsoft.Data.Sqlite package, it doesn't include a LastInsertRowId property in the SqliteConnection class, but you can still call the last_insert_rowid function by using the underlying SQLitePCL library. Here's an extension method:
using Microsoft.Data.Sqlite;
using SQLitePCL;
public static long GetLastInsertRowId(this SqliteConnection connection)
{
var handle = connection.Handle ?? throw new NullReferenceException("The connection is not open.");
return raw.sqlite3_last_insert_rowid(handle);
}
I am working on a console application to insert data to a MS SQL Server 2005 database. I have a list of objects to be inserted. Here I use Employee class as example:
List<Employee> employees;
What I can do is to insert one object at time like this:
foreach (Employee item in employees)
{
string sql = #"INSERT INTO Mytable (id, name, salary)
values ('#id', '#name', '#salary')";
// replace #par with values
cmd.CommandText = sql; // cmd is IDbCommand
cmd.ExecuteNonQuery();
}
Or I can build a balk insert query like this:
string sql = #"INSERT INTO MyTable (id, name, salary) ";
int count = employees.Count;
int index = 0;
foreach (Employee item in employees)
{
sql = sql + string.format(
"SELECT {0}, '{1}', {2} ",
item.ID, item.Name, item.Salary);
if ( index != (count-1) )
sql = sql + " UNION ALL ";
index++
}
cmd.CommandType = sql;
cmd.ExecuteNonQuery();
I guess the later case is going to insert rows of data at once. However, if I have
several ks of data, is there any limit for SQL query string?
I am not sure if one insert with multiple rows is better than one insert with one row of data, in terms of performance?
Any suggestions to do it in a better way?
Actually, the way you have it written, your first option will be faster.
Your second example has a problem in it. You are doing sql = + sql + etc. This is going to cause a new string object to be created for each iteration of the loop. (Check out the StringBuilder class). Technically, you are going to be creating a new string object in the first instance too, but the difference is that it doesn't have to copy all the information from the previous string option over.
The way you have it set up, SQL Server is going to have to potentially evaluate a massive query when you finally send it which is definitely going to take some time to figure out what it is supposed to do. I should state, this is dependent on how large the number of inserts you need to do. If n is small, you are probably going to be ok, but as it grows your problem will only get worse.
Bulk inserts are faster than individual ones due to how SQL server handles batch transactions. If you are going to insert data from C# you should take the first approach and wrap say every 500 inserts into a transaction and commit it, then do the next 500 and so on. This also has the advantage that if a batch fails, you can trap those and figure out what went wrong and re-insert just those. There are other ways to do it, but that would definately be an improvement over the two examples provided.
var iCounter = 0;
foreach (Employee item in employees)
{
if (iCounter == 0)
{
cmd.BeginTransaction;
}
string sql = #"INSERT INTO Mytable (id, name, salary)
values ('#id', '#name', '#salary')";
// replace #par with values
cmd.CommandText = sql; // cmd is IDbCommand
cmd.ExecuteNonQuery();
iCounter ++;
if(iCounter >= 500)
{
cmd.CommitTransaction;
iCounter = 0;
}
}
if(iCounter > 0)
cmd.CommitTransaction;
In MS SQL Server 2008 you can create .Net table-UDT that will contain your table
CREATE TYPE MyUdt AS TABLE (Id int, Name nvarchar(50), salary int)
then, you can use this UDT in your stored procedures and your с#-code to batch-inserts.
SP:
CREATE PROCEDURE uspInsert
(#MyTvp AS MyTable READONLY)
AS
INSERT INTO [MyTable]
SELECT * FROM #MyTvp
C# (imagine that records you need to insert already contained in Table "MyTable" of DataSet ds):
using(conn)
{
SqlCommand cmd = new SqlCommand("uspInsert", conn);
cmd.CommandType = CommandType.StoredProcedure;
SqlParameter myParam = cmd.Parameters.AddWithValue
("#MyTvp", ds.Tables["MyTable"]);
myParam.SqlDbType = SqlDbType.Structured;
myParam.TypeName = "dbo.MyUdt";
// Execute the stored procedure
cmd.ExecuteNonQuery();
}
So, this is the solution.
Finally I want to prevent you from using code like yours (building the strings and then execute this string), because this way of executing may be used for SQL-Injections.
look at this thread,
I've answered there about table valued parameter.
Bulk-copy is usually faster than doing inserts on your own.
If you still want to do it in one of your suggested ways you should make it so that you can easily change the size of the queries you send to the server. That way you can optimize for speed in your production environment later on. Query times may v ary alot depending on the query size.
The batch size for a SQL Server query is listed at being 65,536 * the network packet size. The network packet size is by default 4kbs but can be changed. Check out the Maximum capacity article for SQL 2008 to get the scope. SQL 2005 also appears to have the same limit.