C# Optimisation: Inserting 200 million rows into database

C# Optimisation: Inserting 200 million rows into database - c#

I have the following (simplified) code which I'd like to optimise for speed:
long inputLen = 50000000; // 50 million
DataTable dataTable = new DataTable();
DataRow dataRow;
object[] objectRow;
while (inputLen--)
{
objectRow[0] = ...
objectRow[1] = ...
objectRow[2] = ...
// Generate output for this input
output = ...
for (int i = 0; i < outputLen; i++) // outputLen can range from 1 to 20,000
{
objectRow[3] = output[i];
dataRow = dataTable.NewRow();
dataRow.ItemArray = objectRow;
dataTable.Rows.Add(dataRow);
}
}
// Bulk copy
SqlBulkCopy bulkTask = new SqlBulkCopy(connection, SqlBulkCopyOptions.TableLock, null);
bulkTask.DestinationTableName = "newTable";
bulkTask.BatchSize = dataTable.Rows.Count;
bulkTask.WriteToServer(dataTable);
bulkTask.Close();
I'm already using SQLBulkCopy in an attempt to speed things up, but it appears assigning values to the DataTable itself proves to be slow.
I don't know how DataTables work so I'm wondering if I'm creating unnecessary overhead by first creating a reusable array, then assigning it to a DataRow, then adding the DataRow to the DataTable? Or is using DataTable not optimal in the first place? The input comes from a database.
I don't care much about LOC, just speed. Can anyone give some advice on this?

For such a big table, you should instead use the
public void WriteToServer(IDataReader reader)
method.
It may mean you'll have to implement yourself a "fake" IDataReader interface with your code (if you' don't get the data from an existing IDataReader), but this way, you'll get "streaming" from end to end, and will avoid a 200 million loop.

Instead of holding a huge data table in memory, I would suggest implementing a IDataReader which serves up the data as the bulk copy goes. This will reduce the need to keep everything in memory upfront, and should thus serve to improve performance.

You should not construct entire datatable in memory. Use this overload of WrtieToServer, that takes array of DataRow. Just split in chunks your data.

Related

Split DataTable without using AsEnumerable()

I am writing a C# script that will be run inside another project that makes use of the C# compiler. This engine does not have System.Data.DataSetExtensions referenced nor will it be allowed to do so.
That being said I still need to take a DataTable of 100,000 rows and break it up into smaller DataSets of 10,000 rows max. Normally I would use DataExtensions with something like ..
var filteredDataTable = myDataTable.AsEnumerable().Take(10000).CopyToDataTable();
What would be an efficient way to go about this without using DataSetExtensions? Should I be resigned to using a foreach loop and copying over 10,000 rows into new DataTables?

Should I be resigned to using a foreach loop and copying over 10,000
rows into new DataTables?
Yes, also you may consider writing your own extension method to slice the table and reuse it wherever required. Something like below -
public DataTable SliceTable(this DataTable dt, int rowsCount, int skipRows=0)
{
DataTable dtResult = dt.Clone();
for (int i = skipRows; i < dt.Rows.Count && rowsCount > 0; i++)
{
dtResult.ImportRow(dt.Rows[i]);
rowsCount --;
}
return dtResult;
}
Uses-
DataTable myData; -- Original table
var slice1 = myData.SliceTable(1000); // get slice of first 1000 rows
var slice2 = myData.SliceTable(1000,1000); // get rows from 1001 to 2000
var slice3 = myData.SliceTable(1000,2000); // get rows from 2001 to 3000

How to fill PostgreSQL tables from C# DataSet via Npgsql?

I have a DataSet in C# with DataTables and PostgreSQL database with the same tables. I fill DataTable in my code and want to INSERT DataTable to Postgresql DataBase. I tried to insert it with simple SQL queries (INSERT INTO...), but it's very slowly if I have hundred tables of thousands rows. I guess, using DataAdapter will improve performance, but I cant understand, how does it work. Can you explain me at two cases example?
case1:
Inserting DataSet's tables to Postgresql with DataAdapter
case2:
Inserting only uniq values from DataSet to PostgreSQL (if table in database have rows with uniq keys and DataTable contain the same)
Or maybe you can suggest what to read to learn DataAdapters... Anyway, thanks.

With the exception of trivially small datasets, you're going to have a hard time beating the performance of NpgSql's implementation of copy, which can be accomplished via the BeginTextImport method of your NpgSqlConnection object.
So, regardless of how your data exists in your application, if you dump the output via the text import (copy), it should be very zippy. Here is an example of how you would do that with a datatable. Bear in mind the columns in the datatable and the columns in the table would have to line up -- if not, you need to manage that one way or the other.
This presupposes NpgSql 3.1.9 or higher.
object[] outRow = new object[dt.Columns.Count];
using (var writer = conn.BeginTextImport("copy <table> from STDIN WITH NULL AS '' CSV"))
{
foreach (DataRow rw in dt.Rows)
{
for (int col = 0; col < dt.Columns.Count; col++)
outRow[col] = rw[col];
writer.WriteLine(string.Join(",", outRow));
}
}
As far as duplicates... wow, that really depends. Define "duplicates." If it's just a "select distinct," then it also depends on how many duplicates you expect. If it's a small amount, a List.Exists<> would probably be adequate, but if you have a large number of dupes a Dictionary object would make each lookup a lot more efficient. A typical List lookup is O(n), while a Dictionary lookup would be O(1).
Here's a pretty brute-force example of a dictionary distinct insert for the above example:
object[] outRow = new object[dt.Columns.Count];
Dictionary<string, bool> already = new Dictionary<string, bool>();
bool test;
using (var writer = conn.BeginTextImport("copy <table> from STDIN WITH NULL AS '' CSV"))
{
foreach (DataRow rw in dt.Rows)
{
for (int col = 0; col < dt.Columns.Count; col++)
outRow[col] = rw[col];
string output = string.Join(",", outRow);
if (!already.TryGetValue(output, out test))
{
writer.WriteLine(output);
already.Add(output, true);
}
}
}
Disclaimer: This is a memory pig. If you can manage dupes any other way, or guarantee the ordering of the data, there are numerous other options.
If you can't (or won't) use a bulk copy insert, something that would help performance would be to wrap your inserts into a transaction (NpgSqlTransaction), but for hundreds of thousands of rows, I can't see why you would.

Filling table takes a lot of time in MS Word

I made the following code to add external data table to another table in MS word document, its working fine but takes a lot of time in case that the number of rows is more than 100, and in case of adding table with rows count more that 500 it fills the ms word table really slow and can't complete the task.
I tried to hide the document and disable the screen update for the document but still no solution for the slow performance.
//Get the required external data to the DT data table
DataTable DT = XDt.GetData();
Word.Table TB;
int X = 1;
foreach (DataRow Rw in DT.Rows)
{
Word.Row Rn = TB.Rows.Add(TB.Rows[X + 1]);
for(int i=0;i<=DT.Columns.Count-1;i++)
{
Rn.Cells[i+1].Range.Text = Rw[i].ToString());
}
X++;
}
So is there a way to make this process go faster ?

The most efficient way to add a table to Word is to first concatenate the data in a delimited text string, where "/n" must be the symbol for end-of-row (record separator). The end-of-cell (field separator) can be any character you like that's not in the string content that makes up the table.
Assign this string to a Range object, then use the ConvertToTable() method to create the table.

You're retrieving the last row of the current table for the BeforeRow parameter of TB.Rows.Add. This is significantly slower than simply adding the row. You should replace this:
Word.Row Rn = TB.Rows.Add(TB.Rows[X + 1]);
With this:
Word.Row Rn = TB.Rows.Add();
Utilizing parallelization as suggested in the comments might help slightly, but I'm afraid it's not going to do much good seeing the table add code runs on the main thread as mentioned in this link.
EDIT:
If performance is still an issue, I'd look into creating the Word table independently of the Word object model by using OpenXML. It's orders of magnitude faster.

ConvertToTable method is orders of magnitude faster than adding Rows/Cells one at a time.
while (reader.Read())
{
values = new object[reader.FieldCount];
var cols = reader.GetValues(values);
var item = String.Join("\t", values);
items.Add(item);
};
data = String.Join("\n", items.ToArray());
var tempDocument = application.Documents.Add();
var range = tempDocument.Range();
range.Text = data;
var tempTable = range.ConvertToTable(Separator: Microsoft.Office.Interop.Word.WdTableFieldSeparator.wdSeparateByTabs,
NumColumns: reader.FieldCount,
NumRows: rows, DefaultTableBehavior: WdDefaultTableBehavior.wdWord9TableBehavior,
AutoFitBehavior: WdAutoFitBehavior.wdAutoFitWindow);

Insert large amount of data into database using LINQ

I want to insert around 1 million records into a database using Linq in ASP.NET MVC. But when I try the following code it didn't work. It's throwing an OutOfMemoryException. And also it took 3 days in the loop. Can anyone please help me on this???
db.Database.ExecuteSqlCommand("DELETE From [HotelServices]");
DataTable tblRepeatService = new DataTable();
tblRepeatService.Columns.Add("HotelCode",typeof(System.String));
tblRepeatService.Columns.Add("Service",typeof(System.String));
tblRepeatService.Columns.Add("Category",typeof(System.String));
foreach (DataRow row in xmltable.Rows)
{
string[] servicesarr = Regex.Split(row["PAmenities"].ToString(), ";");
for (int a = 0; a < servicesarr.Length; a++)
{
tblRepeatService.Rows.Add(row["HotelCode"].ToString(), servicesarr[a], "PA");
}
String[] servicesarrA = Regex.Split(row["RAmenities"].ToString(), ";");
for (int b = 0; b < servicesarrA.Length; b++)
{
tblRepeatService.Rows.Add(row["hotelcode"].ToString(), servicesarrA[b], "RA");
}
}
HotelAmenties _hotelamenties;
foreach (DataRow hadr in tblRepeatService.Rows)
{
_hotelamenties = new HotelAmenties();
_hotelamenties.Id = Guid.NewGuid();
_hotelamenties.ServiceName = hadr["Service"].ToString();
_hotelamenties.HotelCode = hadr["HotelCode"].ToString();
db.HotelAmenties.Add(_hotelamenties);
}
db.SaveChanges();
tblRepeatService table has around 1 million rows.

Bulk inserts like this are highly inefficient in LINQtoSQL. Every insert creates at least three objects (the DataRow, the HotelAmenities object and the tracking record for it), chewing up memory on objects you don't need.
Given that you already have a DataTable, you can use System.Data.SqlClient.SqlBulkCopy to push the content of the table to a temporary table on the SQL server, then use a single insert statement to load the data into its final destination. This is the fastest way I have found so far to move many thousands of records from memory to SQL.

If performance doesn't matter and this is a 1 shot job you can stick to the way you're using. Your problem is you're only saving at the end, so entity Framework has to store and generate the SQL for 1 million operations at once, modify your code so that you save every 1000 or so inserts instead of only at the end and it should work just fine.
int i = 0;
foreach (DataRow hadr in tblRepeatService.Rows)
{
_hotelamenties = new HotelAmenties();
_hotelamenties.Id = Guid.NewGuid();
_hotelamenties.ServiceName = hadr["Service"].ToString();
_hotelamenties.HotelCode = hadr["HotelCode"].ToString();
db.HotelAmenties.Add(_hotelamenties);
if((i%1000)==0){
db.SaveChanges();
}
i++;
}
db.SaveChanges();

Copy DataTable.Rows to DataRow[]. Faster way then Select() with the same functionality?

I have now a problem with a very old system of ours. (!It is more then 7 years old and I have no budget and resources to make bigger change in the structure, so the decision to improve the old logic as many as we can.!)
We have an own written gridcontrol. Basically it is like a normal ASP.NET grid, you can add, change, delete elements.
The problem is that the grid has a BindGrid() method, where for further usage, the rows of the datasource table copied into a DataRow[]. I need to keep the DataRow[], but I would like to implement the best way to copy the source from the the table into the array.
The current solution:
DataRow[] rows = DataSource.Select("1=1", SortOrderString);
As I experienced so far, if I need to get a specified sort, that could be the best way (I'm also interested if it has a quicker way or not.)
BUT there are some simplified pages, where the SortOrder is not needed.
So I could make two method one for the sort order and one for without.
The real problem is the second one:
DataRow[] rows = DataSource.Select("1=1");
Because it is very slow. I made some test and it is kind of 15 times slower then the CopyTo() solution:
DataRow[] rows = new DataRow[DataSource.Rows.Count];
DataSource.Rows.CopyTo(rows,0);
I would like to use the faster way, BUT when I made the tests some old function simply crashed. It seems, there is an other difference, what I only noticed now:
The Select() gets the rows like the RowChanges are accepted.
So if I deleted a row, and I do not use the AcceptRowChanges() (I can't do that unfortunately), then with Select("1=1") the row is in the DataSource but not in the DataRow[].
With a simple .CopyTo() the row is there, and that is a bad news for me.
My questions are:
1) Is the Select("1=1") the best way to get the rows by the RowChanges? (I doubt a bit, because it is like 6 year old part)
2) And if 1) is not, is it possible to achieve a faster way with the same result than the .Select("1=1") ?
UPDATE:
Here is a very basic test app, what I used for speedtesting:
DataTable dt = new DataTable("Test");
dt.Columns.Add("Id", typeof (int));
dt.Columns.Add("Name", typeof(string));
for (int i = 0; i < 10000; i++)
{
DataRow row = dt.NewRow();
row["ID"] = i;
row["Name"] = "Name" + i;
dt.Rows.Add(row);
}
dt.AcceptChanges();
DateTime start = DateTime.Now;
DataRow[] rows = dt.Select();
/*DataRow[] rows = new DataRow[dt.Rows.Count];
dt.Rows.CopyTo(rows,0);*/
Console.WriteLine(DateTime.Now - start);

You can call Select without an argument: DataRow[] allRows = DataSource.Select(); That would be for sure more efficient than "1=1" since that applies a pointless RowFilter.
Another way is using Linq-To-DataSet to order and filter the DataTable. That isn't more efficient but more readable and maintainable.
I have yet no example or measurement, but it is obvious that a RowFilter with "1=1" is more expensive than none. Select is implemented in this way:
public Select(DataTable table, string filterExpression, string sort, DataViewRowState recordStates)
{
this.table = table;
this.IndexFields = table.ParseSortString(sort);
this.indexDesc = Select.ConvertIndexFieldtoIndexDesc(this.IndexFields);
// following would be omitted if you would use DataSource.Select() without "1=1"
if (filterExpression != null && filterExpression.Length > 0)
{
this.rowFilter = new DataExpression(this.table, filterExpression);
this.expression = this.rowFilter.ExpressionNode;
}
this.recordStates = recordStates;
}
If you want to be able to select also the rows that are currently not accepted, you can use the overload of Select:
DataRow[] allRows = DataSource.Select("", "", DataViewRowState.CurrentRows | DataViewRowState.Deleted);
This will select all rows inclusive the rows that are deleted even if AcceptChanges was not called yet.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Optimisation: Inserting 200 million rows into database - c#

Instead of holding a huge data table in memory, I would suggest implementing a IDataReader which serves up the data as the bulk copy goes. This will reduce the need to keep everything in memory upfront, and should thus serve to improve performance.

You should not construct entire datatable in memory. Use this overload of WrtieToServer, that takes array of DataRow. Just split in chunks your data.

Related

Split DataTable without using AsEnumerable()

How to fill PostgreSQL tables from C# DataSet via Npgsql?

Filling table takes a lot of time in MS Word

Insert large amount of data into database using LINQ

Copy DataTable.Rows to DataRow[]. Faster way then Select() with the same functionality?

Categories

Resources