I'm working on an import from a CSV file to my ASP.NET MVC3/C#/Entity Framework Application.
Currently this is my code, but I'm looking to optimise:
var excel = new ExcelQueryFactory(file);
var data = from c in excel.Worksheet(0)
select c;
var dataList = data.ToList();
List<FullImportExcel> importList = new List<FullImportExcel>();
foreach (var s in dataList.ToArray())
{
if ((s[0].ToString().Trim().Length < 6) && (s[1].ToString().Trim().Length < 7))
{
FullImportExcel item = new FullImportExcel();
item.Carrier = s[0].ToString().Trim();
item.FlightNo = s[1].ToString().Trim();
item.CodeFlag = s[2].ToString().Trim();
//etc etc (50 more columns here)
importList.Add(item);
}
}
PlannerEntities context = null;
context = new PlannerEntities();
context.Configuration.AutoDetectChangesEnabled = false;
int count = 0;
foreach (var item in importList)
{
++count;
context = AddToFullImportContext(context, item, count, 100, true);
}
private PlannerEntities AddToFullImportContext(PlannerEntities context, FullImportExcel entity, int count, int commitCount, bool recreateContext)
{
context.Set<FullImportExcel>().Add(entity);
if (count % commitCount == 0)
{
context.SaveChanges();
if (recreateContext)
{
context.Dispose();
context = new PlannerEntities();
context.Configuration.AutoDetectChangesEnabled = false;
}
}
return context;
}
This works fine, but isn't as quick as it could be, and the import that I'm going to need to do will be a minimum of 2 million lines every month. Are there any better methods out there for bulk imports?
Am I better avoiding EF altogether and using SQLConnection and inserting that way?
Thanks
I do like how you're only committing records every X number of records (100 in your case.)
I've recently written a system that once a month, needed to update the status of upwards of 50,000 records in one go - this is updating each record and inserting an audit record for each updated record.
Originally I wrote this with the entity framework, and it took 5-6 minutes to do this part of the task. SQL Profiler showed me it was doing 100,000 SQL queries - one UPDATE and one INSERT per record (as expected I guess.)
I changed this to a stored procedure which takes a comma-separated list of record IDs, the status and user ID as parameters, which does a mass-update followed by a mass-insert. This now takes 5 seconds.
In your case, for this number of records, I'd recommend creating a BULK IMPORT file and passing that over to SQL to import.
http://msdn.microsoft.com/en-us/library/ms188365.aspx
For large number of inserts in SQL Server Bulk Copy is the fastest way. You can use the SqlBulkCopy class for accessing Bulk Copy from code. You have to create an IDataReader for your List or you can use this IDataReader for inserting generic Lists I have written.
Thanks to Andy for the heads up - this was the code used in SQL, with a little help from the ever helpful, Pinal Dave - http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/ :)
DECLARE #bulkinsert NVARCHAR(2000)
DECLARE #filepath NVARCHAR(100)
set #filepath = 'C:\Users\Admin\Desktop\FullImport.csv'
SET #bulkinsert =
N'BULK INSERT FullImportExcel2s FROM ''' +
#filepath +
N''' WITH (FIRSTROW = 2, FIELDTERMINATOR = '','', ROWTERMINATOR = ''\n'')'
EXEC sp_executesql #bulkinsert
Still got a bit of work to do to work it into the code, but we're down to 25 seconds for 50000 rows instead of an hour, so a huge improvement!
Related
I have a list of rows that I want to insert in one batch (add X rows, single call to SaveChanges). Unfortunately, from time to time, some of the items in the list already exist. Since the insertion process is taking place in a transaction (all or nothing), nothing gets added.
Code to show the idea:
using (var context = new CacheDbContext())
{
context.Counter.Add(new Counter
{
Id = "test-3",
CounterType = "test",
Expiry = DateTime.UtcNow.AddHours(1),
Value = 0
});
context.Counter.Add(new Counter
{
Id = "test-2",
CounterType = "test",
Expiry = DateTime.UtcNow.AddHours(1),
Value = 0
});
await context.SaveChangesAsync().ConfigureAwait(false);
}
My goal is to do an insertion and if the one or more items are already exists - ignore.
The naive solution is to check if the ID is exists, before insert it. This will work, but it has poor performance. I want to execute multiple inserts, with one call.
I know that it possible using SQL like this:
INSERT INTO table_name(c1)
VALUES(c1)
ON DUPLICATE KEY UPDATE c1 = VALUES(c1) + 1;
if my EF will translate my SQL INSERT statement for something like this, it will be good for me.
Is this possible?
Any other solution will be welcome.
Similar to this question I am running through a datatable, using the data to fill a new dataset for the purposes of data migration.
The migration inserts into a data set then every 5000 records the added rows get saved to the database using EricEJ SqlCeBulkCopy method.
My problem is that for the first amount of records (5000 ish) the average milliseconds taken per record is around 150-200, but it gradually increases. at record 11000 this figure is now at around 475 milliseconds.
I have a typed data set with EnforceConstraints turned off.
The actual database write always takes less than a second so I am pretty sure it is not the database itself, so I am left with the code taken longer to run each iteration, which could be down to the code itself or something I am not realising about datasets.
Could the dataset be increasing the time because it is using indexes or some keys that are not turned off by using the EnforceConstraints = false property?
One other thought is that I am checking to see if a record exists before inserting it, so I have tried both the Linq methods .ANY() and FirstOrDefault() != null
I iterate through a datatable, for each record I read some values then pass them to this method.
private int MigrateItems(string reference, string brand, string captureSite, string captureOperator, DateTime captureDate, DateTime addedDate, DateTime updatedDate, bool retain)
{
//prepare the inputs
reference = reference.Trim();
int brandID = -1, databaseUpdateID = -1, captureID = -1, insertedRowID = -1;
//get the foreign keys
brandID = MigrateBrands(brand);
databaseUpdateID = MigrateDatabaseUpdates(reference);
captureID = MigrateCaptures(captureSite, captureOperator, captureDate);
//if the item doesn't exist then add it
bool exists = dataSet.Item.FirstOrDefault(a => string.Equals(a.Reference, reference, StringComparison.CurrentCultureIgnoreCase)) == null ? false : true; ;
if (exists == false)
{
var insertedRow = dataSet.Item.AddItemRow(brandID, databaseUpdateID, captureID, reference, retain, updatedDate, addedDate);
insertedRowID = insertedRow.ID;
}
else insertedRowID = dataSet.Item.Single(a => string.Equals(a.Reference, reference, StringComparison.CurrentCultureIgnoreCase)).ID;
return insertedRowID;
}
Once 5000 records have been iterated or all records have been done then I call this method:
private void BulkInsertData()
{
using (var bulkCopier = new SqlCeBulkCopy(connectionString))
{
bulkCopier.DestinationTableName = dataSet.Brand.TableName;
bulkCopier.WriteToServer(dataSet.Brand.Where(a => a.RowState == DataRowState.Added).AsEnumerable());
//(same code for all the tables)
//change all row states to unchanged
dataSet.AcceptChanges();
}
}
I'm using the following:
C#
Visual Studio 2012
Sql Server Ce 4.0
I want to insert around 1 million records into a database using Linq in ASP.NET MVC. But when I try the following code it didn't work. It's throwing an OutOfMemoryException. And also it took 3 days in the loop. Can anyone please help me on this???
db.Database.ExecuteSqlCommand("DELETE From [HotelServices]");
DataTable tblRepeatService = new DataTable();
tblRepeatService.Columns.Add("HotelCode",typeof(System.String));
tblRepeatService.Columns.Add("Service",typeof(System.String));
tblRepeatService.Columns.Add("Category",typeof(System.String));
foreach (DataRow row in xmltable.Rows)
{
string[] servicesarr = Regex.Split(row["PAmenities"].ToString(), ";");
for (int a = 0; a < servicesarr.Length; a++)
{
tblRepeatService.Rows.Add(row["HotelCode"].ToString(), servicesarr[a], "PA");
}
String[] servicesarrA = Regex.Split(row["RAmenities"].ToString(), ";");
for (int b = 0; b < servicesarrA.Length; b++)
{
tblRepeatService.Rows.Add(row["hotelcode"].ToString(), servicesarrA[b], "RA");
}
}
HotelAmenties _hotelamenties;
foreach (DataRow hadr in tblRepeatService.Rows)
{
_hotelamenties = new HotelAmenties();
_hotelamenties.Id = Guid.NewGuid();
_hotelamenties.ServiceName = hadr["Service"].ToString();
_hotelamenties.HotelCode = hadr["HotelCode"].ToString();
db.HotelAmenties.Add(_hotelamenties);
}
db.SaveChanges();
tblRepeatService table has around 1 million rows.
Bulk inserts like this are highly inefficient in LINQtoSQL. Every insert creates at least three objects (the DataRow, the HotelAmenities object and the tracking record for it), chewing up memory on objects you don't need.
Given that you already have a DataTable, you can use System.Data.SqlClient.SqlBulkCopy to push the content of the table to a temporary table on the SQL server, then use a single insert statement to load the data into its final destination. This is the fastest way I have found so far to move many thousands of records from memory to SQL.
If performance doesn't matter and this is a 1 shot job you can stick to the way you're using. Your problem is you're only saving at the end, so entity Framework has to store and generate the SQL for 1 million operations at once, modify your code so that you save every 1000 or so inserts instead of only at the end and it should work just fine.
int i = 0;
foreach (DataRow hadr in tblRepeatService.Rows)
{
_hotelamenties = new HotelAmenties();
_hotelamenties.Id = Guid.NewGuid();
_hotelamenties.ServiceName = hadr["Service"].ToString();
_hotelamenties.HotelCode = hadr["HotelCode"].ToString();
db.HotelAmenties.Add(_hotelamenties);
if((i%1000)==0){
db.SaveChanges();
}
i++;
}
db.SaveChanges();
I have dataabase of 2.5 million records. One column in database is empty. Now i want to fill that column by inserting in Bulk, because a simple update query takes a lot of time. But the problem is that bulkinsert starts inserting record at the end, not from the start. Here is my code:
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
for (int j = 0; j < rows.Length; j++)
{
DataRow dailyProductSalesRow = prodSalesData.NewRow();
string[] temp = ((string)rows[j]["Area"]).Split(new string[] { " ", "A","a" }, StringSplitOptions.None);
if (Int32.TryParse(temp[0], out number))
{
int num1 = Int32.Parse(temp[0]);
dailyProductSalesRow["Area1"] = 1;
}
else
{
dailyProductSalesRow["Area1"] = 0;
Console.WriteLine("Invalid "+temp[0]);
}
prodSalesData.Rows.Add(dailyProductSalesRow);
}
s.DestinationTableName = prodSalesData.TableName;
foreach (var column in prodSalesData.Columns)
{
s.ColumnMappings.Add(column.ToString(), column.ToString());
}
try
{
s.WriteToServer(prodSalesData);
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
}
If you need to update, use an UPDATE stamenet
BULK INSERT is designed to help make inserting many records faster as INSERT can be very slow and difficult to use when trying to get a large amount of data into a database.
UPDATE is already the fastest way of updating records in a database. In your case I'd guess you want a statement a bit like this based on what your BULK INSERT is apparently trying to do.
UPDATE SalesData
SET Area1 = CONVERT(INT, SUBSTRING(Area, 0, CHARINDEX('A', Area)))
If you want to try and make it faster then try looking at Optimizing Bulk Import Performance. Some of the advice there (e.g. the recovery model used by the database) may also be applicable to UPDATE performance when updating many records.
I'm currently building an application that needs a feature to import a user-supplied CSV file as data into a database. Each "cell" in the CSV will be stored in its own row.
Initially I was using parameterized queries to insert each row one-by-one, but the speed of the operation (520,000 inserts in one example file!) meant I'm having to reconsider that. I'm now parsing the CSV file into an IEnumerable<Answer> and handing it over to the following code to be inserted into the database in batches:
public void AddAnswers(IEnumerable<Answer> answers)
{
const int batchSize = 1000;
var values = new StringBuilder();
var i = 0;
foreach (var answer in answers)
{
if (i++ > 0)
{
values.Append(",");
}
values.AppendFormat("({0},{1},'{2}')", answer.AnswerSetId, answer.QuestionId, answer.Value.Replace("'", "''"));
if (i == batchSize)
{
// We've reached the batch size limit - send what we have so far
SendAnswerBatch(values.ToString());
values.Clear();
i = 0;
}
}
if (i > 0)
{
// Ensure any leftovers that didn't reach the maximum batch size are sent over
SendAnswerBatch(values.ToString());
}
}
private void SendAnswerBatch(string values)
{
var query = String.Format("INSERT INTO Answers (AnswerSetId,QuestionId,Value) VALUES {0}", values);
Context.Database.ExecuteSqlCommand(query);
}
This changed a large set of data from taking over 5 minutes to less than 5 seconds to insert, however I realise that basic replacing of ' with '' is not safe.
Obviously the safest way to insert a single row would be to use a parameterized query but is there a way to make such a thing work with a batch insert like this?
If at all possible, I also need it to be non-database specific - I had already considered SqlBulkCopy but the application needs to support multiple database engines.
i would suggest you use sqlBulkCopy, when inserting a lot of values, this provided to be really usefull to me
place your items into a datatable and let SqlBulkCopy do the rest.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx