In timing our code, I noticed that there is a substantial transaction time each time Linq communicates with SQL Server. In the past, when we used SQL directly, we could place multiple statements and send it at once. Is there a way to do this in Linq? In particular, I have two tables, a Log table and a UHAs (userhostaddress) table. If the uha is not already in the UHAs table, it must be inserted, and then the Log entry made with the uhaid. In Linq, this takes three calls, once to verify that the uha does not exist, once to insert it, and once for the log. Can I do this in one call to the database?
var uha = db.UHAs.Where(u => u.userhostaddress == _userHostAddress).FirstOrDefault();
if (uha == null)
{
var newUha = new UHA()
{
userhostaddress = _userHostAddress
};
db.UHAs.InsertOnSubmit(newUha);
db.SubmitChanges(); // 2. Second call
uha = newUha;
}
SHA1 sha1 = SHA1.Create();
var newLog= new Log()
{
requested = DateTime.UtcNow,
uhaid = uha.uhaid,
query = _query,
queryhash = sha1.ComputeHash(Encoding.UTF8.GetBytes(_query))
};
db.Log.InsertOnSubmit(newLog);
db.SubmitChanges();
Since you are using the same instance of the data context you only need to call SubmitChanges() once. That will execute all commands at once rather than making multiple trips to the sql server.
Related
In the project, I need to call an external API based on time. So, for one day, I may need to call the API 24 times, one call for one hour period. The API result is a XML file which has 6 fields. I will need to insert these data into a table. Averagely, for each hour, it has about 20,000 rows data.
The table has these 6 columns:
col1, col2, col3, col4, col5, col6
When all 6 columns are the same, we consider the rows are the same, and we should not insert duplications.
I'm using C# and Entity Framework for this:
foreach (XmlNode node in nodes)
{
try
{
count++;
CallData data = new CallData();
...
// get all data and set in 'data'
// check whether in database already
var q = ctx.CallDatas.Where(x => x.col1 == data.col1
&& x.col2 == data.col2
&& x.col3 == data.col3
&& x.col4 == data.col4
&& x.col5 == data.col5
&& x.col6 == data.col6
).Any();
if (q)
{
// exists in database, skip
// log info
}
else
{
string key = $"{data.col1}|{data.col2}|{data.col3}|{data.col4}|{data.col5}|{data.col6}";
// check whether in current chunk already
if (dic.ContainsKey(key))
{
// in current chunk, skip
// log info
}
else
{
// insert
ctx.CallDatas.Add(data);
// update dic
dic.Add(key, true);
}
}
}
catch (Exception ex)
{
// log error
}
}
Logger.InfoFormat("Saving changes ...");
if (ctx.ChangeTracker.HasChanges())
{
await ctx.SaveChangesAsync();
}
Logger.InfoFormat("Saving changes ... Done.");
The code works fine. However, we will need to use this code to run for past several months. The issue is: the code runs slow since for each row it will need to check whether it exists already.
Is there any suggestions to improve the performance?
Thanks
You don't show the code on when the context is created or the life-cycle. I'm inclined to point you to your indexes on the table. If these aren't primary keys then you might see the performance issue there. If you are doing full table scans, it will be progressively slower. With that said, there are two separate ways to handle the
The EF Native way: You can explicitly create a new connection on each interaction (avoiding change tracking for all entries reducing progressive slowdown). Also, your save is async but your *Any statement is sync. Using async for that as well might help take some pressure off the current thread if it's waiting.
// Start your context scope closer to the data call, as if the look is long
// running you could be building up tracked changes in the cache, this prevents
// that situation.
using (YourEntity ctx = new YourEntity())
{
CallData data = new CallData();
if (await ctx.CallDatas.Where(x => x.col1 == data.col1
&& x.col2 == data.col2
&& x.col3 == data.col3
&& x.col4 == data.col4
&& x.col5 == data.col5
&& x.col6 == data.col6
).AnyAsync()
)
{
// exists in database, skip
// log info
}
else
{
string key = $"{data.col1}|{data.col2}|{data.col3}|{data.col4}|{data.col5}|{data.col6}";
// check whether in current chunk already
if (dic.ContainsKey(key))
{
// in current chunk, skip
// log info
}
else
{
// insert
ctx.CallDatas.Add(data);
await ctx.SaveChangesAsync();
// update dic
dic.Add(key, true);
}
}
}
Optional Way: Look into inserting the data using a bulk operation via store procedure. 20k rows is trivial, and you can still use entity framework for that as well. See https://stackoverflow.com/a/9837927/1558178
I have created my own version of this (customized for my specific needs) and have found that it works well and give more control for bulk inserts.
I have used this ideology to insert 100k records at a time. I have my logic in the stored procedure for checking for duplicates which gives me better control as well as reducing the over the wire call to 0 reads and 1 write. This should just take a second or two to execute assuming your stored procedure is optimized.
Different approach:
Save all rows with duplicates - should be very efficient
When you use data from the table use DISTINCT for all fields.
For raw, bulk operations like this I would consider avoiding EF entities and context tracking and merely execute SQL through the context:
var sql = $"IF NOT EXISTS(SELECT 1 FROM CallDates WHERE Col1={data.Col1} AND Col2={data.Col2} AND Col3={data.Col3} AND Col4={data.Col4} AND Col5={data.Col5} AND Col6={data.Col6}) INSERT INTO CallDates(Col1,Col2,Col3,Col4,Col5,Col6) VALUES ({data.Col1},{data.Col2},{data.Col3},{data.Col4},{data.Col5},{data.Col6})";
context.Database.ExeculeSqlCommand(sql);
This does without the extra checks and logging, just effectively raw SQL with duplicate detection.
Below is the code i am using to add records to database.I know I am calling saveChanges() everytime which is expensive, but if a call save changes once after all i might get duplicate key exception. So i am looking for any idea to make it better to improve performance keeping duplicate records in mind.
using (var db = new dbEntities())
{
for (int i = 0; i < csvCustomers.Count; i++)
{
var csvCustomer = csvCustomers[i];
dbcustomer customer = new dbcustomer() { ADDRESS = csvCustomer.ADDRESS, FIRSTNAME = csvCustomer.FIRSTNAME, LASTNAME = csvCustomer.LASTNAME, PHONE = csvCustomer.PHONE, ZIPCODE = csvCustomer.ZIP };
try
{
dbzipcode z = db.dbzipcodes.FirstOrDefault(x => x.ZIP == customer.ZIPCODE);
//TODO: Handle if Zip Code not Found in DB
if (z == null)
{
db.dbcustomers.Add(customer);
throw new DbEntityValidationException("Zip code not found in database.");
}
customer.dbzipcode = z;
z.dbcustomers.Add(customer);
db.SaveChanges();
}
}
}
One solution that i have in my mind is to add data in batches and then call db.SaveChanges() and in case of Exception reduces the batch size recursively for those records.
Using EF to insert huge #'s of records is going to incur a significant cost compared to more direct approaches, but there are a few considerations you can make to markedly improve performance.
Firstly, batching the requests with a save changes will be preferential to saving individual records, or attempting to commit all of the changes at once. You will need to deal with exceptions if/when a batch fails. (Possibly committing that batch one at a time to fully isolate duplicate rows)
Next, you can pre-cache your zip codes rather than looking it up each iteration. Don't load the entire entity, just cache the zip code and the ID into an in-memory list:
(If the zip code entity amounts to little more than this, then just load the entity)
var zipCodes = db.dbzipcodes.Select(x => new {x.ZIPCODEID, x.ZIP}).ToList();
This will require a bit of extra attention when it comes to associating a zipcode to the customer within the batched calls since the zip code will initially not be known by the DbContext but may be known when the second customer for the same zip code is added.
To associate a zip code without loading it in a DbContext:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
If you did load the entire zip code entity into the cached list, then the var zipCode = new dbzipcode ... is not needed, just attach the cached entity.
However, if in the batch that zip code has already been associated to the DbContext you will get an error, (regardless of whether you cached the entity or just the ID/Code) so you need to first check the dbContext in-memory zip codes:
var customerZipCode = zipCodes.SingleOrDefault(x => x.ZIP = customer.ZIPCODE);
// + exists check...
var zipCode = db.dbzipcodes.Local.SingleOrDefault(x => x.ZIPCODEID == customerZipCode.ZIPCODEID)
?? new dbzipcode { ZIPCODEID = customerZipCode.ZIPCODEID };
db.dbzipcodes.Attach(zipCode);
customer.dbzipcode = zipCode;
// ...
Lastly, EF tracks a lot of extra info in memory as the context so the other consideration along with batching would be to avoid using the same DbContext across all batches, rather opening a DbContext with each batch. As you add items and call SaveChanges across a DbContext, it is still tracking each entity that gets added. If you did batches of 1000 or so, the context would be tracking just that 1000 rather than 1000 then 2000, then 3000, etc. up to 5 Million rows.
Azure storage tables all have a timestamp column. Based on documentation here the listed way to delete from a storage table is to select an entity then delete it.
Does anyone know how to delete any entity from a storage table based on a datetime comparison on the timestamp value using code?
EDIT:
Based on the advice given I wrote the following code. However, it throws a Bad Request exception on my table.ExecuteQuery(rangeQuery) call. Any advice?
StorageCredentials creds = new StorageCredentials(logAccountName, logAccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, useHttps: true);
CloudTableClient client = account.CreateCloudTableClient();
CloudTable table = client.GetTableReference(LogTable);
TableQuery<CloudQuerySummary> rangeQuery = new TableQuery<CloudQuerySummary>()
.Where(TableQuery.GenerateFilterCondition("Timestamp", QueryComparisons.LessThan
, DateTime.Now.AddHours(- DateTime.Now.Hour).ToString()));
TableOperation deleteOperation;
// Loop through the results, displaying information about the entity.
foreach (CloudQuerySummary entity in table.ExecuteQuery(rangeQuery))
{
deleteOperation = TableOperation.Delete(entity);
table.Execute(deleteOperation);
}
EDIT 2
Here is the final working code for anyone who chooses to copy/reference it.
public void DeleteLogsNotFromToday()
{
StorageCredentials creds = new StorageCredentials(logAccountName, logAccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, useHttps: true);
CloudTableClient client = account.CreateCloudTableClient();
CloudTable table = client.GetTableReference(LogTable);
TableQuery<CloudQuerySummary> rangeQuery = new TableQuery<CloudQuerySummary>()
.Where(TableQuery.GenerateFilterConditionForDate("Timestamp", QueryComparisons.LessThan
, DateTime.Now.AddHours(-DateTime.Now.Hour)));
try
{
TableOperation deleteOperation;
// Loop through the results, displaying information about the entity.
foreach (CloudQuerySummary entity in table.ExecuteQuery(rangeQuery))
{
deleteOperation = TableOperation.Delete(entity);
table.Execute(deleteOperation);
}
}
catch (Exception ex)
{
throw;
}
}
You will have to do a partition scan to do that, as entities are only indexed on their PartitionKey and RowKey.
In the tutorial link you posted, look at the section Retrieve a range of entities in a partition. Once you get the entities you want to delete, you will then execute a table operation to delete them.
If you don't want to delete them one by one, you can create a batch delete operation (provided all entities to delete have the same partition key). The link above also instructs how to construct a batch operation.
Alternatively, if you do not want to do a table scan, you should store the date reference (for instance, storing the date in milliseconds as the RowKey) and then use that to filter the entities you need to delete based on a date-time comparison (something similar to THIS)
UPDATE: I think the problem is in this line:
DateTime.Now.AddHours(- DateTime.Now.Hour).ToString()
As from the documentation:
The Timestamp property is a DateTime value that is maintained on the
server side to record the time an entity was last modified
You are trying to compare a DateTime property with a String. I'm no C# expert, but that does not look to me as a valid comparison.
If you use Slazure this kind of job becomes easier, the following code should also work with the Light(free) edition.
using SysSurge.Slazure;
using SysSurge.Slazure.Linq;
using SysSurge.Slazure.Linq.QueryParser;
namespace TableOperations
{
public class LogOperations
{
public static void DeleteOldLogEntities()
{
// Get a reference to the table storage, example just uses the development storage
dynamic storage = new QueryableStorage<DynEntity>("UseDevelopmentStorage=true");
// Get a reference to the table named "LogTable"
QueryableTable<DynEntity> logTable = storage.LogTable;
var query = logTable.Where("Timestamp > #0", DateTime.UtcNow.AddDays(-1));
// Delete all returned log entities
foreach (var entity in query)
logTable.Delete(entity.PartitionKey, entity.RowKey);
}
}
}
Full disclosure: I coded Slazure.
Currently, I am struggling with an issue regarding Entity Framework (LINQ to Entities). Most of the time when I try to execute entity.SaveChanges() everything works fine but at some points entity.SaveChanges() takes too much and timesouts. I searched a lot but was unable to find out the answer.
(According to companies policy, I cannot copy code somewhere else. So, I do not have the exact code but I will try to layout the basic structure. I hope it helps you to figure out the problem but if i doesn't then let me know.)
Task:
My task is to scan the whole network for some specific files. Match content of each file with the content of database and based on the matching either insert or update the database with the content of the file. I have around 3000 files on the network.
Problem:
public void PerformAction()
{
DbTransaction tran = null;
entity.Connection.Open(); //entity is a global variable declared like myDatabaseEntity entity = new myDatabaseEntity();
tran = entity.Connection.BeginTransaction();
foreach(string path in listOfPaths)
{
//returns 1 - Multiple matching in database OR
// 2 - One matching file in database OR
// 3 - No Matching found.
int returnValue = SearchDatabase();
if(returnValue == 1)
DoSomething(); //All inserts/updates work perfectly. Save changes also works correctly.
else if(returnValue == 2)
DoSomething(); //Again, everything ok. SaveChanges works perfectly here.
else
{
//This function uses some XML file to generate all the queries dynamically
//Forexample INSERT INTO TABLEA(1,2,3);
GenerateInsertQueriesFromXML();
ExecuteQueries();
SaveChanges(); <---- Problem here. Sometimes take too much time.
}
//Transaction commit/rollback code here
}
}
public bool ExecuteQueries()
{
int result = 0;
foreach(string query in listOfInsertQueries)
{
result = entity.ExecuteStoreCommand(query); //Execute the insert queries
if(result <=0)
return false;
}
entity.TestEntityA a = new entity.TestEntityA();
a.PropertyA = 123;
a.PropertyB = 345;
//I have around 25 properties here
entity.AddToTestEntityA(a);
return true;
}
Found the issue.
The main table where i was inserting all the data had a trigger on INSERT and DELETE.
So, whenever i inserted some new data in the main table, the trigger was firing in the backend and was taking all the time.
Entity framework is FAST and INNOCENT :D
Is there a "best practice" way of handling bulk inserts (via LINQ) but discard records that may already be in the table? Or I am going to have to either do a bulk insert into an import table then delete duplicates, or insert one record at a time?
08/26/2010 - EDIT #1:
I am looking at the Intersect and Except methods right now. I am gathering up data from separate sources, converting into a List, want to "compare" to the target DB then INSERT just the NEW records.
List<DTO.GatherACH> allACHes = new List<DTO.GatherACH>();
State.IState myState = null;
State.Factory factory = State.Factory.Instance;
foreach (DTO.Rule rule in Helpers.Config.Rules)
{
myState = factory.CreateState(rule.StateName);
List<DTO.GatherACH> stateACHes = myState.GatherACH();
allACHes.AddRange(stateACHes);
}
List<Model.ACH> newRecords = new List<Model.ACH>(); // Create a disconnected "record set"...
foreach (DTO.GatherACH record in allACHes)
{
var storeInfo = dbZach.StoreInfoes.Where(a => a.StoreCode == record.StoreCode && (a.TypeID == 2 || a.TypeID == 4)).FirstOrDefault();
Model.ACH insertACH = new Model.ACH
{
StoreInfoID = storeInfo.ID,
SourceDatabaseID = (byte)sourceDB.ID,
LoanID = (long)record.LoanID,
PaymentID = (long)record.PaymentID,
LastName = record.LastName,
FirstName = record.FirstName,
MICR = record.MICR,
Amount = (decimal)record.Amount,
CheckDate = record.CheckDate
};
newRecords.Add(insertACH);
}
The above code builds the newRecords list. Now, I am trying to get the records from this List that are not in the DB by comparing on the 3 field Unique Index:
AchExceptComparer myComparer = new AchExceptComparer();
var validRecords = dbZach.ACHes.Intersect(newRecords, myComparer).ToList();
The comparer looks like:
class AchExceptComparer : IEqualityComparer<Model.ACH>
{
public bool Equals(Model.ACH x, Model.ACH y)
{
return (x.LoanID == y.LoanID && x.PaymentID == y.PaymentID && x.SourceDatabaseID == y.SourceDatabaseID);
}
public int GetHashCode(Model.ACH obj)
{
return base.GetHashCode();
}
}
However, I am getting this error:
LINQ to Entities does not recognize the method 'System.Linq.IQueryable1[MisterMoney.LARS.ZACH.Model.ACH] Intersect[ACH](System.Linq.IQueryable1[MisterMoney.LARS.ZACH.Model.ACH], System.Collections.Generic.IEnumerable1[MisterMoney.LARS.ZACH.Model.ACH], System.Collections.Generic.IEqualityComparer1[MisterMoney.LARS.ZACH.Model.ACH])' method, and this method cannot be translated into a store expression.
Any ideas? And yes, this is completely inline with the original question. :)
You can't do bulk inserts with LINQ to SQL (I presume you were referring to LINQ to SQL when you said "LINQ"). However, based on what you're describing, I'd recommend checking out the new MERGE operator of SQL Server 2008.
Inserting, Updating, and Deleting Data by Using MERGE
Another example here.
I recommend you just write the SQL yourself to do the inserting, I find it is a lot faster and you can get it to work exactly how you want it to. When I did something similar to this (just a one-off program) I just used a Dictionary to hold the ID's I had inserted already, to avoid duplicates.
I find LINQ to SQL is good for one record or a small set that does its entire lifespan in the LINQ to SQL.
Or you can try to use SQL Server 2008's Bulk Insert .
One thing to watch out for is if you queue more than 2000 or so records without calling SubmitChanges() - TSQL has a limit on the number of statements per execution, so you cannot simply queue up every record and then call SubmitChanges() as this will throw an SqlException, you need to periodically clear the queue to avoid this.