I have a large CSV file... 10 columns, 100 million rows, roughly 6 GB in size on my hard disk.
I want to read this CSV file line by line and then load the data into a Microsoft SQL server database using SQL bulk copy.
I have read couple of threads on here and also on the internet. Most people suggest that reading a CSV file in parallel doesn't buy much in terms of efficiency as the tasks/threads contend for disk access.
What I'm trying to do is, read line by line from CSV and add it to blocking collection of size 100K rows. And once this collection is full spin up a new task/thread to write the data to SQL server using SQLBuckCopy API.
I have written this piece of code, but hitting an error at run time that says "Attempt to invoke bulk copy on an object that has a pending operation." This scenario looks like something that can be easily solved using .NET 4.0 TPL but I'm not able to get it work. Any suggestions on what I'm doing wrong?
public static void LoadCsvDataInParalleToSqlServer(string fileName, string connectionString, string table, DataColumn[] columns, bool truncate)
{
const int inputCollectionBufferSize = 1000000;
const int bulkInsertBufferCapacity = 100000;
const int bulkInsertConcurrency = 8;
var sqlConnection = new SqlConnection(connectionString);
sqlConnection.Open();
var sqlBulkCopy = new SqlBulkCopy(sqlConnection.ConnectionString, SqlBulkCopyOptions.TableLock)
{
EnableStreaming = true,
BatchSize = bulkInsertBufferCapacity,
DestinationTableName = table,
BulkCopyTimeout = (24 * 60 * 60),
};
BlockingCollection<DataRow> rows = new BlockingCollection<DataRow>(inputCollectionBufferSize);
DataTable dataTable = new DataTable(table);
dataTable.Columns.AddRange(columns);
Task loadTask = Task.Factory.StartNew(() =>
{
foreach (DataRow row in ReadRows(fileName, dataTable))
{
rows.Add(row);
}
rows.CompleteAdding();
});
List<Task> insertTasks = new List<Task>(bulkInsertConcurrency);
for (int i = 0; i < bulkInsertConcurrency; i++)
{
insertTasks.Add(Task.Factory.StartNew((x) =>
{
List<DataRow> bulkInsertBuffer = new List<DataRow>(bulkInsertBufferCapacity);
foreach (DataRow row in rows.GetConsumingEnumerable())
{
if (bulkInsertBuffer.Count == bulkInsertBufferCapacity)
{
SqlBulkCopy bulkCopy = x as SqlBulkCopy;
var dataRows = bulkInsertBuffer.ToArray();
bulkCopy.WriteToServer(dataRows);
Console.WriteLine("Inserted rows " + bulkInsertBuffer.Count);
bulkInsertBuffer.Clear();
}
bulkInsertBuffer.Add(row);
}
},
sqlBulkCopy));
}
loadTask.Wait();
Task.WaitAll(insertTasks.ToArray());
}
private static IEnumerable<DataRow> ReadRows(string fileName, DataTable dataTable)
{
using (var textFieldParser = new TextFieldParser(fileName))
{
textFieldParser.TextFieldType = FieldType.Delimited;
textFieldParser.Delimiters = new[] { "," };
textFieldParser.HasFieldsEnclosedInQuotes = true;
while (!textFieldParser.EndOfData)
{
string[] cols = textFieldParser.ReadFields();
DataRow row = dataTable.NewRow();
for (int i = 0; i < cols.Length; i++)
{
if (string.IsNullOrEmpty(cols[i]))
{
row[i] = DBNull.Value;
}
else
{
row[i] = cols[i];
}
}
yield return row;
}
}
}
Don't.
Parallel access may or may not give you faster read of the file (it won't, but I'm not going to fight that battle...) but for certain parallel writes it won't give you faster bulk insert. That is because minimally logged bulk insert (ie. the really fast bulk insert) requires a table lock. See Prerequisites for Minimal Logging in Bulk Import:
Minimal logging requires that the target table meets the following conditions:
...
- Table locking is specified (using TABLOCK).
...
Parallel inserts, by definition, cannot obtain concurrent table locks. QED. You are barking up the wrong tree.
Stop getting your sources from random finding on the internet. Read The Data Loading Performance Guide, is the guide to ... performant data loading.
I would recommend to you stop inventing the wheel. Use an SSIS, this is exactly what is designed to handle.
http://joshclose.github.io/CsvHelper/
https://efbulkinsert.codeplex.com/
If possible for you, I suggest you read your file into a List<T> using the aforementioned csvhelper and write to your db using bulk insert as you are doing or efbulkinsert which I have used and is amazingly fast.
using CsvHelper;
public static List<T> CSVImport<T,TClassMap>(string csvData, bool hasHeaderRow, char delimiter, out string errorMsg) where TClassMap : CsvHelper.Configuration.CsvClassMap
{
errorMsg = string.Empty;
var result = Enumerable.Empty<T>();
MemoryStream memStream = new MemoryStream(Encoding.UTF8.GetBytes(csvData));
StreamReader streamReader = new StreamReader(memStream);
var csvReader = new CsvReader(streamReader);
csvReader.Configuration.RegisterClassMap<TClassMap>();
csvReader.Configuration.DetectColumnCountChanges = true;
csvReader.Configuration.IsHeaderCaseSensitive = false;
csvReader.Configuration.TrimHeaders = true;
csvReader.Configuration.Delimiter = delimiter.ToString();
csvReader.Configuration.SkipEmptyRecords = true;
List<T> items = new List<T>();
try
{
items = csvReader.GetRecords<T>().ToList();
}
catch (Exception ex)
{
while (ex != null)
{
errorMsg += ex.Message + Environment.NewLine;
foreach (var val in ex.Data.Values)
errorMsg += val.ToString() + Environment.NewLine;
ex = ex.InnerException;
}
}
return items;
}
}
Edit - I don't understand what you are doing with the bulk insert. You want to bulk insert the whole list or data data table, not row-by-row.
You can create store procedure and pass the file location like below
CREATE PROCEDURE [dbo].[CSVReaderTransaction]
#Filepath varchar(100)=''
AS
-- STEP 1: Start the transaction
BEGIN TRANSACTION
-- STEP 2 & 3: checking ##ERROR after each statement
EXEC ('BULK INSERT Employee FROM ''' +#Filepath
+''' WITH (FIELDTERMINATOR = '','', ROWTERMINATOR = ''\n'' )')
-- Rollback the transaction if there were any errors
IF ##ERROR <> 0
BEGIN
-- Rollback the transaction
ROLLBACK
-- Raise an error and return
RAISERROR ('Error in inserting data into employee Table.', 16, 1)
RETURN
END
COMMIT TRANSACTION
You can also add BATCHSIZE option like FIELDTERMINATOR and ROWTERMINATOR.
Related
I am using a basic streamreader to loop through a csv file of about 65gb (450 million rows).
using (sr = new StreamReader(currentFileName))
{
string headerLine = sr.ReadLine(); // skip the headers
while ((string currentTick = sr.ReadLine()) != null)
{
string[] tickValue = currentTick.Split(',');
// Ticks are formatted and added to the array in order to insert them afterwards.
}
}
This creates a list that will hold the ticks that belong to a candle and than call the insertTickBatch function.
private async static Task insertTickBatch(List<Tick> ticks)
{
if (ticks != null && ticks.Any())
{
using (DatabaseEntities db = new DatabaseEntities())
{
db.Configuration.LazyLoadingEnabled = false;
int currentCandleId = ticks.First().CandleId;
var candle = db.Candles.Where(c => c.Id == currentCandleId).FirstOrDefault();
foreach (var curTick in ticks)
{
candle.Ticks.Add(curTick);
}
await db.SaveChangesAsync();
db.Dispose();
Thread.Sleep(10);
}
}
}
This however takes about 15 years to complete and my intention is to speed this up. How do I achieve this?
I am not sure which EF you are using, but if available try this instead of your foreach loop:
db.Ticks.AddRange(ticks);
Also, CSVHelper is a nice package that can convert your entire file into an Tick object list, and of course the Thread.Sleep has to go.
I'm working on an importer in our web application. With the code I currently have, when you are connecting via local SQL server, it runs fine and within reason. I'm also creating a .sql script that they can download as well
Example 1
40k records, 8 columns, from 1 minute and 30 seconds until 2 minutes
When I move it to production and Azure app service, it is running VERY slow.
Example 2
40k records, 8 columns, from 15 minutes to 18 minutes
The current database is set to: Pricing tier: Standard S2: 50 DTUs
Here is the code:
using (var sqlConnection = new SqlConnection(connectionString))
{
try
{
var generatedScriptFilePathInfo = GetImportGeneratedScriptFilePath(trackingInfo.UploadTempDirectoryPath, trackingInfo.FileDetail);
using (FileStream fileStream = File.Create(generatedScriptFilePathInfo.GeneratedScriptFilePath))
{
using (StreamWriter writer = new StreamWriter(fileStream))
{
sqlConnection.Open();
sqlTransaction = sqlConnection.BeginTransaction();
await writer.WriteLineAsync("/* Insert Scripts */").ConfigureAwait(false);
foreach (var item in trackingInfo.InsertSqlScript)
{
errorSqlScript = item;
using (var cmd = new SqlCommand(item, sqlConnection, sqlTransaction))
{
cmd.CommandTimeout = 800;
cmd.CommandType = CommandType.Text;
await cmd.ExecuteScalarAsync().ConfigureAwait(false);
}
currentRowLine++;
rowsProcessedUpdateEveryXCounter++;
rowsProcessedTotal++;
// append insert statement to the file
await writer.WriteLineAsync(item).ConfigureAwait(false);
}
// write out a couple of blank lines to separate insert statements from post scripts (if there are any)
await writer.WriteLineAsync(string.Empty).ConfigureAwait(false);
await writer.WriteLineAsync(string.Empty).ConfigureAwait(false);
}
}
}
catch (OverflowException exOverFlow)
{
sqlTransaction.Rollback();
sqlTransaction.Dispose();
trackingInfo.IsSuccessful = false;
trackingInfo.ImportMetricUpdateError = new ImportMetricUpdateErrorDTO(trackingInfo.ImportMetricId)
{
ErrorLineNbr = currentRowLine + 1, // add one to go ahead and count the record we are on to sync up with the file
ErrorMessage = string.Format(CultureInfo.CurrentCulture, "{0}", ImporterHelper.ArithmeticOperationOverflowFriendlyErrorText),
ErrorSQL = errorSqlScript,
RowsProcessed = currentRowLine
};
await LogImporterError(trackingInfo.FileDetail, exOverFlow.ToString(), currentUserId).ConfigureAwait(false);
await UpdateImportAfterFailure(trackingInfo.ImportMetricId, exOverFlow.Message, currentUserId).ConfigureAwait(false);
return trackingInfo;
}
catch (Exception ex)
{
sqlTransaction.Rollback();
sqlTransaction.Dispose();
trackingInfo.IsSuccessful = false;
trackingInfo.ImportMetricUpdateError = new ImportMetricUpdateErrorDTO(trackingInfo.ImportMetricId)
{
ErrorLineNbr = currentRowLine + 1, // add one to go ahead and count the record we are on to sync up with the file
ErrorMessage = string.Format(CultureInfo.CurrentCulture, "{0}", ex.Message),
ErrorSQL = errorSqlScript,
RowsProcessed = currentRowLine
};
await LogImporterError(trackingInfo.FileDetail, ex.ToString(), currentUserId).ConfigureAwait(false);
await UpdateImportAfterFailure(trackingInfo.ImportMetricId, ex.Message, currentUserId).ConfigureAwait(false);
return trackingInfo;
}
}
Questions
Is there any way to speed this up on Azure? Or is the only way to upgrade the DTUs?
We are looking into SQL Bulk Copy as well. Will this help any or still cause slowness on Azure: https://learn.microsoft.com/en-us/dotnet/api/system.data.sqlclient.sqlbulkcopy?redirectedfrom=MSDN&view=dotnet-plat-ext-5.0
Desired results
Run at the same speed when running it at a local SQL Server database
For now, I updated my code to batch the insert statements based on how many records. If the record count is over 10k, then it will batch them by dividing the total by 10.
This helped performance BIG TIME on our Azure instance. I was able to add 40k records within 30 seconds. I also think some of the issue was how many different slots use our app service on Azure.
We will also probably move to SQLBulkCopy later on as users need to import larger excel files.
Thanks everyone for the helps and insights!
// apply the create table SQL script if found.
if (string.IsNullOrWhiteSpace(trackingInfo.InsertSqlScript.ToString()) == false)
{
int? updateEveryXRecords = GetProcessedEveryXTimesForApplyingInsertStatementsValue(trackingInfo.FileDetail);
trackingInfo.FileDetail = UpdateImportMetricStatus(trackingInfo.FileDetail, ImportMetricStatus.ApplyingInsertScripts, currentUserId);
int rowsProcessedUpdateEveryXCounter = 0;
int rowsProcessedTotal = 0;
await UpdateImportMetricsRowsProcessed(trackingInfo.ImportMetricId, rowsProcessedTotal, trackingInfo.FileDetail.ImportMetricStatusHistories).ConfigureAwait(false);
bool isBulkMode = trackingInfo.InsertSqlScript.Count >= 10000;
await writer.WriteLineAsync("/* Insert Scripts */").ConfigureAwait(false);
int insertCounter = 0;
int bulkCounter = 0;
int bulkProcessingAmount = 0;
int lastInsertCounter = 0;
if (isBulkMode == true)
{
bulkProcessingAmount = trackingInfo.InsertSqlScript.Count / 10;
}
await LogInsertBulkStatus(trackingInfo.FileDetail, isBulkMode, trackingInfo.InsertSqlScript.Count, bulkProcessingAmount, currentUserId).ConfigureAwait(false);
StringBuilder sbInsertBulk = new StringBuilder();
foreach (var item in trackingInfo.InsertSqlScript)
{
if (isBulkMode == false)
{
errorSqlScript = item;
using (var cmd = new SqlCommand(item, sqlConnection, sqlTransaction))
{
cmd.CommandTimeout = 800;
cmd.CommandType = CommandType.Text;
await cmd.ExecuteScalarAsync().ConfigureAwait(false);
}
currentRowLine++;
rowsProcessedUpdateEveryXCounter++;
rowsProcessedTotal++;
// append insert statement to the file
await writer.WriteLineAsync(item).ConfigureAwait(false);
// Update database with the insert statement created count to alert the user of the status.
if (updateEveryXRecords.HasValue)
{
if (updateEveryXRecords.Value == rowsProcessedUpdateEveryXCounter)
{
await UpdateImportMetricsRowsProcessed(trackingInfo.ImportMetricId, rowsProcessedTotal, trackingInfo.FileDetail.ImportMetricStatusHistories).ConfigureAwait(false);
rowsProcessedUpdateEveryXCounter = 0;
}
}
}
else
{
sbInsertBulk.AppendLine(item);
if (bulkCounter < bulkProcessingAmount)
{
errorSqlScript = string.Format(CultureInfo.CurrentCulture, "IsBulkMode is True | insertCounter = {0}", insertCounter);
bulkCounter++;
}
else
{
// display to the end user
errorSqlScript = string.Format(CultureInfo.CurrentCulture, "IsBulkMode is True | currentInsertCounter value = {0} | lastInsertCounter (insertCounter when the last bulk insert occurred): {1}", insertCounter, lastInsertCounter);
await ApplyBulkInsertStatements(sbInsertBulk, writer, sqlConnection, sqlTransaction, trackingInfo, rowsProcessedTotal).ConfigureAwait(false);
bulkCounter = 0;
sbInsertBulk.Clear();
lastInsertCounter = insertCounter;
}
rowsProcessedTotal++;
}
insertCounter++;
}
// get the remaining records after finishing the forEach insert statement
if (isBulkMode == true)
{
await ApplyBulkInsertStatements(sbInsertBulk, writer, sqlConnection, sqlTransaction, trackingInfo, rowsProcessedTotal).ConfigureAwait(false);
}
}
/// <summary>
/// Applies the bulk insert statements.
/// </summary>
/// <param name="sbInsertBulk">The sb insert bulk.</param>
/// <param name="wrtier">The wrtier.</param>
/// <param name="sqlConnection">The SQL connection.</param>
/// <param name="sqlTransaction">The SQL transaction.</param>
/// <param name="trackingInfo">The tracking information.</param>
/// <param name="rowsProcessedTotal">The rows processed total.</param>
/// <returns>Task</returns>
private async Task ApplyBulkInsertStatements(
StringBuilder sbInsertBulk,
StreamWriter wrtier,
SqlConnection sqlConnection,
SqlTransaction sqlTransaction,
ProcessImporterTrackingDTO trackingInfo,
int rowsProcessedTotal)
{
var bulkInsertStatements = sbInsertBulk.ToString();
using (var cmd = new SqlCommand(bulkInsertStatements, sqlConnection, sqlTransaction))
{
cmd.CommandTimeout = 800;
cmd.CommandType = CommandType.Text;
await cmd.ExecuteScalarAsync().ConfigureAwait(false);
}
// append insert statement to the file
await wrtier.WriteLineAsync(bulkInsertStatements).ConfigureAwait(false);
// Update database with the insert statement created count to alert the user of the status.
await UpdateImportMetricsRowsProcessed(trackingInfo.ImportMetricId, rowsProcessedTotal, trackingInfo.FileDetail.ImportMetricStatusHistories).ConfigureAwait(false);
}
I have a WCF service that query a database and returns a large number of records. There is so many records, that the server runs out of memory and fails before it can return.
So I want to send the records back as I fetch them from the database, or a set number back at a time.
For additional clarity, I cannot collect call records fetched into a collection on the server, as the server runs out of memory before I have collected all the records. I want to try and find away to send them back one by one or in chunks, in one call.
For example, in chunks:
Fetch first 1000 records
Add to collection
Send collection to client
Clear collection
Fetch next 1000 records, and repeat from step 2
So the idea I have how the web service code will look something like this:
Public IEnumerable<Customer> GetAllCustomers()
{
// Setup Query
string query = PrepareQuery();
// Create Connection
connection = new SqlConnection(ConnectionString);
connection.Open();
var sqlcommand = connection.CreateCommand();
sqlcommand.CommandText = query.ToString();
// Read Results
var reader = sqlcommand.ExecuteReader();
while (reader.Read())
{
Customer customer = new Customer();
foreach (var column in Columns)
{
int fieldIndex = reader.GetOrdinal(column);
object value = reader.GetValue(fieldIndex);
customer[column.Name] = value;
}
yield return customer;
}
}
I don't want to consider paging as the Order By on the SQL server is slow.
Looking for way to do this in WCF
I think you answer your own question. There are 2 ways to do it, stream or chunk.
You can do streaming in wcf - see https://learn.microsoft.com/en-us/dotnet/framework/wcf/feature-details/large-data-and-streaming
You get a Stream to write to, so you need to handle yourself how you are going to encode your data on that stream, and how you are going decode it at the client.
The alternative is you do chunking/paging. You just modify your service so it accepts e.g. a page number or some other way to indicate which page is needed.
Which one you do depends on the application, eg how much data? what is the nature of the client? is it possible to use some field to page on? etc etc
Here is some psudo code for making a stream that can do this on the server side. It is based on the example here: https://learn.microsoft.com/en-us/dotnet/framework/wcf/feature-details/how-to-enable-streaming
I'm not writing the full compilable code for you, but this is the gist of it.
In the server:
public Stream GetBigData()
{
return new BigDataStream();
}
BigDataStream (the non-implimented methods are not shown):
class BigDataStream : Stream
{
public BigDataStream()
{
// open DB connection
// run your query
// get a DataReader
}
// you need a buffer to encode your data between calls to Read
List<byte> _encodeBuffer = new List<byte>();
public override int Read(byte[] buffer, int offset, int count)
{
// read from the DataReader and populate the _encodeBuffer
// until the _encodeBuffer contains at least count bytes
// (or until there are no more records)
// for example:
while (_encodeBuffer.Count < count && _reader.Read())
{
// (1)
// encode the record into a byte array. How to do this?
// you can read into a class and then use the data
// contract serialization for example. If you do this, you
// will probably find it easier to prepend an integer which
// specifies the length of the following encoded message.
// This will make it easier for the client to deserialize it.
// (2)
// append the encoded record bytes (plus any length prefix
// etc) to _encodeBuffer
}
// remove up to the first count bytes from _encodeBuffer
// and copy them into buffer at the offset requested
// return the number of bytes added
}
public override void Close()
{
// close the reader + db connection
base.Close();
}
}
Thank to mikelegg & Reniuz for helping come to a solution. I wish I could give them the tick for the right answer, but I am a afraid the next developer to read this question would not fully benefit. So where is what I ended up with.
Setup the config files for the Server and Client (Follow link: Large Data and Streaming)
Followed this solution, can download source code from here
I had to change the DBRowStream.DBThreadProc method a bit to work so I post the source code:
DBRowStream Class:
void DBThreadProc(object o)
{
SqlConnection con = null;
SqlCommand com = null;
try
{
con = new System.Data.SqlClient.SqlConnection(/*ConnectionString*/);
com = new SqlCommand();
com.Connection = con;
com.CommandText = PrepareQuery();
con.Open();
SqlDataReader reader = com.ExecuteReader();
int count = 0;
MemoryStream memStream = memStream1;
memStreamWriteStatus = 1;
readyToWriteToMemStream1.WaitOne();
while (reader.Read())
{
// Populate
Customer customer = new Customer();
foreach (var column in Columns)
{
int fieldIndex = reader.GetOrdinal(column);
object value = reader.GetValue(fieldIndex);
customer[column.Name] = value;
}
// Serialize: I used a custom Serializer
// but BinaryFormatter should be fine
DBDataFormatter.Serialize(memStream, customer);
count++;
if (count == PAGESIZE) // const int PAGESIZE = 10000
{
switch (memStreamWriteStatus)
{
case 1: // done writing to stream 1
{
memStream1.Position = 0;
readyToSendFromMemStream1.Set();
// write stream 1 is done...waiting for stream 2
readyToWriteToMemStream2.WaitOne();
memStream = memStream2;
memStream.Position = 0;
memStream.SetLength(0); // Added:To Reset the stream. Else was getting garbage data back
memStreamWriteStatus = 2;
break;
}
case 2: // done writing to stream 2
{
memStream2.Position = 0;
readyToSendFromMemStream2.Set();
// Write on stream 2 is done...waiting for stream 1
readyToWriteToMemStream1.WaitOne();
// done waiting for stream 1
memStream = memStream1;
memStreamWriteStatus = 1;
memStream.Position = 0;
memStream.SetLength(0); // Added: Reset the stream. Else was getting garbage data back
break;
}
}
count = 0;
}
}
if (count > 0)
{
switch (memStreamWriteStatus)
{
case 1: // done writing to stream 1
{
memStream1.Position = 0;
readyToSendFromMemStream1.Set();
// END write stream 1 is done...waiting for stream 2
break;
}
case 2: // done writing to stream 2
{
memStream2.Position = 0;
readyToSendFromMemStream2.Set();
// END write stream 2 is done...waiting for stream 1
break;
}
}
}
bDoneWriting = true;
bCanRead = false;
}
catch
{
throw;
}
finally
{
if (com != null)
{
com.Dispose();
com = null;
}
if (con != null)
{
con.Close();
con.Dispose();
con = null;
}
}
}
And then the Client side:
private static void TestGetRecordsAndDump()
{
const string FILE_NAME = "Records.CSV";
File.Delete(FILE_NAME);
var file = File.AppendText(FILE_NAME);
long count = 0;
try
{
ServiceReference1.ServiceClient service = new ServiceReference1.DataServiceClient();
var stream = service.GetDBRowStream();
Console.WriteLine("Records Retrieved : ");
Console.WriteLine("File Size (MB) : ");
var canDoLastRead = true;
while (stream.CanRead && canDoLastRead)
{
try
{
Customer customer = DBDataFormatter.Deserialize(stream); // Used custom Deserializer, but BinaryFormatter should be fine
file.Write(customer.ToString());
count++;
}
catch
{
canDoLastRead = false; // Bug: stream.CanRead is not set to false at the end of stream, so I do this trick to know if I finished retruning all records.
}
finally
{
Console.SetCursorPosition("Records Retrieved : ".Length, 0);
Console.Write(string.Format("{0} ", count));
Console.SetCursorPosition("File Size (MB) : ".Length, 1);
Console.Write(string.Format("{0:G} ", file.BaseStream.Length / 1024f / 1024f));
}
}
finally
{
file.Close();
}
}
}
There is a bug I cannot seem to solve, that stream.CanRead is not set to false, then all the records have been returned, have not been able to work out why, but at least now, I can query large data sets, and return all records, with out the server or client running out of memory.
I'm working on a utility to read through a JSON file I've been given and to transform it into SQL Server. My weapon of choice is a .NET Core Console App (I'm trying to do all of my new work with .NET Core unless there is a compelling reason not to). I have the whole thing "working" but there is clearly a problem somewhere because the performance is truly horrifying almost to the point of being unusable.
The JSON file is approximately 27MB and contains a main array of 214 elements and each of those contains a couple of fields along with an array of from 150-350 records (that array has several fields and potentially a small <5 record array or two). Total records are approximately 35,000.
In the code below I've changed some names and stripped out a few of the fields to keep it more readable but all of the logic and code that does actual work is unchanged.
Keep in mind, I've done a lot of testing with the placement and number of calls to SaveChanges() think initially that number of trips to the Db was the problem. Although the version below is calling SaveChanges() once for each iteration of the 214-record loop, I've tried moving it outside of the entire looping structure and there is no discernible change in performance. In other words, with zero trips to the Db, this is still SLOW. How slow you ask, how does > 24 hours to run hit you? I'm willing to try anything at this point and am even considering moving the whole process into SQL Server but would much reather work in C# than TSQL.
static void Main(string[] args)
{
string statusMsg = String.Empty;
JArray sets = JArray.Parse(File.ReadAllText(#"C:\Users\Public\Downloads\ImportFile.json"));
try
{
using (var _db = new WidgetDb())
{
for (int s = 0; s < sets.Count; s++)
{
Console.WriteLine($"{s.ToString()}: {sets[s]["name"]}");
// First we create the Set
Set eSet = new Set()
{
SetCode = (string)sets[s]["code"],
SetName = (string)sets[s]["name"],
Type = (string)sets[s]["type"],
Block = (string)sets[s]["block"] ?? ""
};
_db.Entry(eSet).State = Microsoft.EntityFrameworkCore.EntityState.Added;
JArray widgets = sets[s]["widgets"].ToObject<JArray>();
for (int c = 0; c < widgets.Count; c++)
{
Widget eWidget = new Widget()
{
WidgetId = (string)widgets[c]["id"],
Layout = (string)widgets[c]["layout"] ?? "",
WidgetName = (string)widgets[c]["name"],
WidgetNames = "",
ReleaseDate = releaseDate,
SetCode = (string)sets[s]["code"]
};
// WidgetColors
if (widgets[c]["colors"] != null)
{
JArray widgetColors = widgets[c]["colors"].ToObject<JArray>();
for (int cc = 0; cc < widgetColors.Count; cc++)
{
WidgetColor eWidgetColor = new WidgetColor()
{
WidgetId = eWidget.WidgetId,
Color = (string)widgets[c]["colors"][cc]
};
_db.Entry(eWidgetColor).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetTypes
if (widgets[c]["types"] != null)
{
JArray widgetTypes = widgets[c]["types"].ToObject<JArray>();
for (int ct = 0; ct < widgetTypes.Count; ct++)
{
WidgetType eWidgetType = new WidgetType()
{
WidgetId = eWidget.WidgetId,
Type = (string)widgets[c]["types"][ct]
};
_db.Entry(eWidgetType).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
// WidgetVariations
if (widgets[c]["variations"] != null)
{
JArray widgetVariations = widgets[c]["variations"].ToObject<JArray>();
for (int cv = 0; cv < widgetVariations.Count; cv++)
{
WidgetVariation eWidgetVariation = new WidgetVariation()
{
WidgetId = eWidget.WidgetId,
Variation = (string)widgets[c]["variations"][cv]
};
_db.Entry(eWidgetVariation).State = Microsoft.EntityFrameworkCore.EntityState.Added;
}
}
}
_db.SaveChanges();
}
}
statusMsg = "Import Complete";
}
catch (Exception ex)
{
statusMsg = ex.Message + " (" + ex.InnerException + ")";
}
Console.WriteLine(statusMsg);
Console.ReadKey();
}
I had an issue with that kind of code, lots of loops and tons of changing state.
Any change / manipulation you make in _db context, will generate a "trace" of it. And it making your context slower each time. Read more here.
The fix for me was to create new EF context(_db) at some key points. It saved me a few hours per run!
You could try to create a new instance of _db each iteration in this loop
contains a main array of 214 elements
If it make no change, try to add some stopwatch to get a best idea of what/where is taking so long.
If you're making thousands of updates then EF is not really the way to go. Something like SQLBulkCopy will do the trick.
You could try the bulkwriter library.
IEnumerable<string> ReadFile(string path)
{
using (var stream = File.OpenRead(path))
using (var reader = new StreamReader(stream))
{
while (reader.Peek() >= 0)
{
yield return reader.ReadLine();
}
}
}
var items =
from line in ReadFile(#"C:\products.csv")
let values = line.Split(',')
select new Product {Sku = values[0], Name = values[1]};
then
using (var bulkWriter = new BulkWriter<Product>(connectionString)) {
bulkWriter.WriteToDatabase(items);
}
I've written a small console app that I point to a folder containing DBF/FoxPo files.
It then creates a table in SQL based on each dbf table, then does a bulk copy to insert the data into SQL. It works quite well for the most part, except for a few snags..
1) Some of the FoxPro tables contain 5000000+ records and the connection expries before the insert completes..
Here is my connection string:
<add name="SQL" connectionString="data source=source_source;persist security info=True;user id=DBFToSQL;password=DBFToSQL;Connection Timeout=20000;Max Pool Size=200" providerName="System.Data.SqlClient" />
Error message:
"Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding."
CODE:
using (SqlConnection SQLConn = new SqlConnection(SQLString))
using (OleDbConnection FPConn = new OleDbConnection(FoxString))
{
ServerConnection srvConn = new Microsoft.SqlServer.Management.Common.ServerConnection(SQLConn);
try
{
FPConn.Open();
string dataString = String.Format("Select * from {0}", tableName);
using (OleDbCommand Command = new OleDbCommand(dataString, FPConn))
using (OleDbDataReader Reader = Command.ExecuteReader(CommandBehavior.SequentialAccess))
{
tbl = new Table(database, tableName, "schema");
for (int i = 0; i < Reader.FieldCount; i++)
{
col = new Column(tbl, Reader.GetName(i), ConvertTypeToDataType(Reader.GetFieldType(i)));
col.Nullable = true;
tbl.Columns.Add(col);
}
tbl.Create();
BulkCopy(Reader, tableName);
}
}
catch (Exception ex)
{
// LogText(ex, #"C:\LoadTable_Errors.txt", tableName);
throw ex;
}
finally
{
SQLConn.Close();
srvConn.Disconnect();
}
}
private DataType ConvertTypeToDataType(Type type)
{
switch (type.ToString())
{
case "System.Decimal":
return DataType.Decimal(18, 38);
case "System.String":
return DataType.NVarCharMax;
case "System.Int32":
return DataType.Int;
case "System.DateTime":
return DataType.DateTime;
case "System.Boolean":
return DataType.Bit;
default:
throw new NotImplementedException("ConvertTypeToDataType Not implemented for type : " + type.ToString());
}
}
private void BulkCopy(OleDbDataReader reader, string tableName)
{
using (SqlConnection SQLConn = new SqlConnection(SQLString))
{
SQLConn.Open();
SqlBulkCopy bulkCopy = new SqlBulkCopy(SQLConn);
bulkCopy.DestinationTableName = "schema." + tableName;
try
{
bulkCopy.WriteToServer(reader);
}
catch (Exception ex)
{
//LogText(ex, #"C:\BulkCopy_Errors.txt", tableName);
}
finally
{
SQLConn.Close();
reader.Close();
}
}
}
My 2nd & 3rd errors are the following:
I understand what the issues are, but how to rectify them i'm not so sure
2) "The provider could not determine the Decimal value. For example, the row was just created, the default for the Decimal column was not available, and the consumer had not yet set a new Decimal value."
3) SqlDateTime overflow. Must be between 1/1/1753 12:00:00 AM and 12/31/9999 11:59:59 PM.
I found a result on google that indicated what the issue is : [A]... and a possible work around [B] (but I'd like to keep my decimal values as decimal, and dates as date, as I'll be doing further calculations against the data)
What I'm wanting to do as a solution
1.) Either increase the connection time, (but i dont think i can increase it any more than i have), or alternatively is it possible to split the OleDbDataReader's results and do in incremental bulk insert?
2.)I was thinking if its possible to have bulk copy to ignore results with errors, or have the records that do error out log to a csv file or something to that extent?
So where you do the "for" statement I would probably break it up to take so many at a time :
int i = 0;
int MaxCount = 1000;
while (i < Reader.FieldCount)
{
var tbl = new Table(database, tableName, "schema");
for (int j = i; j < MaxCount; j++)
{
col = new Column(tbl, Reader.GetName(j), ConvertTypeToDataType(Reader.GetFieldType(j)));
col.Nullable = true;
tbl.Columns.Add(col);
i++;
}
tbl.Create();
BulkCopy(Reader, tableName);
}
So, "i" keeps track of the overall count, "j" keeps track of the incremental count (ie your max at one time count) and when you have created your 'batch', you create the table and Bulk Copy it.
Does that look like what you would expect?
Cheers,
Chris.
This is my current attemt at the bulk copy method, I't works for about 90% of the tables, but i get an OutOfMemory exeption, with the bigger tables... I'd like to split the reader's data into smaller secions, without having to pass it into a DataTable and store it in memory first (which is the cause of the OutOfMemory exception on the bigger result sets)
UPDATE
Imodified the code below as to how it looks in my solution.. It aint pretty.. but it works. I'll def do some refactoring, and update my answer again.
private void BulkCopy(OleDbDataReader reader, string tableName, Table table)
{
Console.WriteLine(tableName + " BulkCopy Started.");
try
{
DataTable tbl = new DataTable();
List<Type> typeList = new List<Type>();
foreach (Column col in table.Columns)
{
tbl.Columns.Add(col.Name, ConvertDataTypeToType(col.DataType));
typeList.Add(ConvertDataTypeToType(col.DataType));
}
int batch = 1;
int counter = 0;
DataRow tblRow = tbl.NewRow();
while (reader.Read())
{
counter++;
int colcounter = 0;
foreach (Column col in table.Columns)
{
try
{
tblRow[colcounter] = reader[colcounter];
}
catch (Exception)
{
tblRow[colcounter] = GetDefault(typeList[0]);
}
colcounter++;
}
tbl.LoadDataRow(tblRow.ItemArray, true);
if (counter == BulkInsertIncrement)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
counter = PerformInsert(tableName, tbl, batch);
batch++;
}
}
if (counter > 0)
{
Console.WriteLine(tableName + " :: Batch >> " + batch);
PerformInsert(tableName, tbl, counter);
}
tbl = null;
Console.WriteLine("BulkCopy Success!");
}
catch (Exception ex)
{
Console.WriteLine("BulkCopy Fail!");
SharedLogger.Write(ex, #"C:\BulkCopy_Errors.txt", tableName);
Console.WriteLine(ex.Message);
}
finally
{
reader.Close();
reader.Dispose();
}
Console.WriteLine(tableName + " BulkCopy Ended.");
Console.WriteLine("*****");
Console.WriteLine("");
}