It's been a couple of days that I'm working on improving NHibernate Insert performance.
I'd read in many posts (such as this one) that stateless session can insert like 1000~2000 records per second.... However the best time that it could insert 1243 records is more than 9 seconds for me :
var sessionFactory = new NHibernateConfiguration().CreateSessionFactory();
using (IStatelessSession statelessSession = sessionFactory.OpenStatelessSession())
{
statelessSession.SetBatchSize(adjustmentValues.Count);
foreach (var adj in adjustmentValues)
statelessSession.Insert(adj);
}
The class :
public partial class AdjustmentValue : PersistentObject, IFinancialValue
{
public virtual double Amount { get; set; }
public virtual bool HasManualValue { get; set; }
public virtual bool HasScaleValue { get; set; }
public virtual string Formula { get; set; }
public virtual DateTime IssueDate { get; set; }
public virtual CompanyTopic CompanyTopic { get; set; }
}
Map of the class :
public class AdjustmentValueMap : ClassMap<AdjustmentValue>
{
public AdjustmentValueMap()
{
Id(P => P.Id);
Map(p => p.Amount);
Map(p => p.IssueDate);
Map(p => p.HasManualValue);
Map(p => p.HasScaleValue);
Map(p => p.Formula);
References(p => p.CompanyTopic)
.Fetch.Join();
}
}
Am I missing something?
Any idea how to speed up the inserts?
The generated queries will be same as below :
from the looks of your NHProf results you are using identity as your POID. Therefore you cannot take advantage of batched writes. every insert/update/delete is a separate command. that is why it's taking so long.
if you change your POID to hilo, guid or guid.comb and set the batch size to 500 or 1000 then you will see a drastic improvement in the write times.
I'm assuming you are using SQL Server 2008.
Few things that come to mind. Are you using the identity key (select SCOPE_IDENTITY() in your sample output) as a primary key for your entities? If yes then I believe NHibernate has to execute the SCOPE_IDENTITY() call for each object before the object is actually saved into the database. So if you are inserting 1000 objects Nhibernate will generate 1000 INSERT statements and 1000 select SCOPE_IDENTITY() statements.
I'm not 100% sure but it might also break the batching. Since you are using NHProf then what does it say? Does it show that all the statements are batched together or can you select individual INSERT statement in the NHProf UI? If your inserts are not batched then you will most likely see "Large number of individual writes" alert in NHProf.
Edit:
If you cannot change your identity generation then you could use SqlBulkCopy. I have used it with NHibernate in data migration and it works. Ayende Rahien has sample on his blog which gets you started.
Related
I have huge transactions table in azure database, where we import files with +1 million objects.
public class Transaction
{
[Key]
public int Id { get; set; }
public int TransactionId { get; set; }
public DateTime Date { get; set; }
public decimal Price { get; set; }
public int UserId { get; set; }
public string Product { get; set; }
public int ClientId { get; set; }
public int Uploaed { get; set; }
public string UniqueId { get; set; }
public string Custom1 { get; set; }
public string Custom2 { get; set; }
public string Custom3{ get; set; }
}
after importing all new data I take all new transaction ids, and take all transaction ids for that client from database.
// ids from import
string transactionsString = string.Join(",", transactionIdsCsv);
var result = await _transactionsDataRepository.GetByTransactionIdsAndClientId(transactionIdsCsv.ToArray(), clientId);
// ids from repository
string transactionsDBString = string.Join(",", result.ToList());
// remove rows in db where duplicate transactions ids and clientId=ClientId
but I am struggling to find the most effective way. I wanted to do something like
delete from transactions where transactionId IN (transactionsDBString) and clientId = ClientID but that would delete both values and I only want new value to be deleted (and old value to stay)
but would that be a good way? even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows.
I only want new value to be deleted (and old value to stay)
Since you already know how to identify the transaction IDs you want to delete you could delete the necessary rows while keeping the latest like so (you didn't mention it but I'm assuming you're using Entity Framework - given your use of the [Key] attribute - correct me if I'm wrong):
var transToRemove = dbContext.Transactions
.Where(t => t.ClientId == clientId && transIds.Contains(t.TransactionId))
.GroupBy(t => t.TransactionId, t => t) // Group transactions with the same TransactionId
.SelectMany(
group => group.OrderBy(t => t.Date) // Order the oldest first
.Skip(1) // Skip the oldest (we want to keep it)
);
dbContext.Transactions.RemoveRange(transToRemove);
dbContext.SaveChanges();
Edit: Included an example that should work for Dapper...
var cn = // Create your DbConnection
// This query should select all transactions you want to delete excluding
// those with the oldest Date. This is just like 'transToRemove' above
var selectQuery = #"
SELECT t1.Id FROM Transactions t1
INNER JOIN (
SELECT
MIN(tInner.Date) AS FirstTransDate,
tInner.TransactionId,
tInner.ClientId
FROM Transactions tInner
WHERE tInner.ClientId = #clientId
AND tInner.TransactionId IN #transIds
GROUP BY tInner.TransactionId, tInner.ClientId
) t2 ON t2.ClientId = t1.ClientId AND t2.TransactionId = t1.TransactionId
WHERE t1.Date != t2.FirstTransDate
";
var idsToDelete = cn.Query<int>(
selectQuery,
new { clientId, transIds }).ToList();
// Delete the whole list in one go
cn.Execute("DELETE FROM Transactions WHERE Id in #idsToDelete", new {idsToDelete});
(inspiration from here and here)
I haven't tested this using Dapper but the list of idsToDelete should be correct according to this fiddle I made. A couple things to note:
Depending on how long your list of transIds is (I believe those ID's are in result in your own example) you might want to repeat this in smaller batches instead of trying to delete the whole list in one go.
The SQL query above doesn't take into account if two "duplicate" transactions have the same "oldest" Date. If that can happen in your table, then this query will only remove all "duplicate" rows apart from those two.
Improvements
There are a couple of things that seem a little out of place with your setup that I think you should consider:
even fetching var result = await _transactionsDataRepository... can take a lot of time since there are millions of rows
Millions of rows should not be an issue for any decent database server to handle. It sounds like you are missing some indexes on your table. With proper indexes your queries should be pretty swift as long as you can keep them simple.
but would that be a good way?
Not quite sure what you are referring to being good or bad here, but I'll interpret a little... Right now you are writing tons of rows to the table that seems to contain duplicate data. When I think of a transaction-based system, no two transactions should share the same ID. That means for two different ClientIds there should never be a case where t1.TransactionId == t2.TransactionId. Now you can avoid checking ClientId in my code-snippet above.
Since you want to only keep 1 transaction for each TransactionId will you ever need to have two transactions with the same TransactionId? If not, then you can go even further and make the TransactionId column unique and avoid inserting two rows with the same TransactionId. You can use the Entity Framework [Index(IsUnique=true)] attribute to also create an index to speed up queries on that column/property.
I have been facing this problem some time, and to be honest I am myself confused with it so please excuse me if i don't succeed explaining it as I should.
I am trying to insert some data into a Table called CommunicationAttachment which is related as One to Many relationship with Communication; every communication could have many attachments.
The thing is that I get:
UpdateException: Invalid Column Name: "Communication_CommunicationId
when I try to insert list of attachments.
And please note that I am using the repository pattern but I even tried the normal way and the issue wasn't fixed.
I tried tracing the transaction that happens on the database and I figured out that it sends Communication_CommunicationId with the Insert statement, yet there is no such column. I am pretty sure I didn't send such a column.
Here is my code (this is happening when adding new Communication); first of all I call CasefileAttachments to make copies from them, and Communications are related to CaseFiles:
public List<CorrespondenceAttachment> GetCaseFileAttachments(List<Guid> CorrespondenceAttachmentIds)
{
List<CorrespondenceAttachment> originalAttachments = new List<CorrespondenceAttachment>();
foreach (var item in CorrespondenceAttachmentIds)
{
var attachment = QueryData.Query<CorrespondenceAttachment>().Where(att => att.CorrespondenceAttachmentID == item).FirstOrDefault();
originalAttachments.Add(attachment);
}
return originalAttachments;
}
Then I copy the CaseFileAttachments and create new objects of CommunicationAttachments :
public List<CommunicationAttachment> CopyCaseFileAttachmentsToCommunication(List<CorrespondenceAttachment> originalAttachments,Guid communicationId)
{
var communicationAttachments = new List<CommunicationAttachment>();
if (originalAttachments.Any())
{
foreach (var attachmentRef in originalAttachments)
{
var CommunicationAttachmentId = Guid.NewGuid();
communicationAttachments.Add(new CommunicationAttachment()
{
CommunicationAttachmentId = CommunicationAttachmentId,
DmsFileId = CommunicationAttachmentId,
CommunicationId = communicationId,
AttachmentTitle = attachmentRef.AttachmentTitle,
MimeType = attachmentRef.MimeType,
NewVersionID = null,
UploadDate = DateTime.Now,
Size = attachmentRef.Size,
Version = "0001",
AttachmentsGroupId = attachmentRef.AttachmentsGroupId,
DocumentId = attachmentRef.DocumentId,
RelativePath = attachmentRef.RelativePath,
Extension = attachmentRef.Extension,
AttachmentSubject = attachmentRef?.AttachmentSubject,
ExternalContactID = attachmentRef?.ExternalContactID,
AttachmentNumber = string.IsNullOrEmpty(attachmentRef?.AttachmentNumber) ? null : attachmentRef.AttachmentNumber,
TemplatedmsId = attachmentRef.TemplatedmsId,
State = eSense.Framework.Data.ObjectState.Added,
});
}
}
return communicationAttachments;
}
and the methods above are called something like this way:
public void AddNewCommunication(CommunicationDto communicationDto)
{
var communication = communicationDto
if (communicationDto.CommunicationAttachmentIdList.Any())
{
caseFileAttachments = GetCaseFileAttachments(communicationDto.CommunicationAttachmentIdList);
if (caseFileAttachments.Any())
{
commAttachments = CopyCaseFileAttachmentsToCommunication(caseFileAttachments, communication.CommunicationId);
}
}
communication.Attachments = commAttachments;
Save(communication)
}
So what could be the problem that I get a wrong column name?
Here is the relation between Communication and CommunicationAttachment
Note I added only the Important fields so don't bother if the declaring does not match the entity
Communication Entity:
public class Communication : BaseEntity
{
public Communication()
{
Attachments = new HashSet<CommunicationAttachment>();
}
[Key]
public Guid CommunicationId { get; set; }
public string Subject { get; set; }
public string CommunicationNumber { get; set; }
public virtual ICollection<CommunicationAttachment> Attachments { get; set; }
public DateTime DateCreated { get; set; }
public Guid? PreviousCommunicationId { get; set; }
[ForeignKey("PreviousCommunicationId")]
public virtual Communication PreviousCommunication { get; set; }
}
CommunicationAttachment Entity:
public class CommunicationAttachment : AttachmentBaseWithDelegation<Guid>
{
public override Guid PrimaryId
{
get
{
return this.CommunicationAttachmentId;
}
}
public CommunicationAttachment()
{
}
[Key]
public Guid CommunicationAttachmentId { get; set; }
private string _attachmentNumber;
public string AttachmentNumber { get; set; }
[ForeignKey("NewVersionID")]
public virtual CommunicationAttachment CaseFileAttachmentNewerVersion { get; set; }
public Guid CommunicationId { get; set; }
[ForeignKey("CommunicationId")]
public virtual Communication Communication { get; set; }
}
Sorry if you found it hard to understand my question I myself is confused!
Thanks in advance.
This is typically a case where a relationship between entities is not set up correctly. It would appear that EF should be resolving this relationship by convention if Communication's PK is "CommunicationId".
I notice that you've commented out a line to set the CommunicationId on the new entity:
//CommunicationId = communicationId,
What fields are in the CommunicationAttachment? is there a CommunicationId? Is there a Communication navigation property? What configuration settings are you are using?
For example, with fluent configuration I would have something like:
(CommunicationEntityConfiguration)
If CommunicationAttachment has a navigation property back to Communication and a FK field called CommunicationId...
HasMany(x => x.CommunicationAttachments)
.WithRequired(x => x.Communication)
.HasForeignKey(x => x.CommunicationId);
If the attachment entity has a navigation property without a mapped FK in the entity...
HasMany(x => x.CommunicationAttachments)
.WithRequired(x => x.Communication)
.Map(x => x.MapKey("CommunicationId"));
If the attachment entity does not have a navigation property, but has a FK in the entity...
HasMany(x => x.CommunicationAttachments)
.WithRequired()
.HasForeignKey(x => x.CommunicationId);
Or lastly if the attachment entity does not have a navigation property nor a mapped FK...
If the attachment entity does not have a navigation property, but has a FK in the entity...
HasMany(x => x.CommunicationAttachments)
.WithRequired()
.Map(x => x.MapKey("CommunicationId"));
I am a big fan of explicit mapping over convention as it is very clear as to what maps to what, and how, in order to resolve potential mapping conflicts. If the rest of the similar relations seem to be working and just this one is playing up, I'd be looking for possible typos in the field names. With a mapped collection like above, setting a Communcation.CommunicationAttachments.Add(attachment) should be setting the FK / related entity on the attachment without having to explicitly set the FK or related entity manually.
One additional note:
From your example I see you are setting Primary Keys manually client-side using Guid.NewGuid(). It is generally better to allow the database to manage PK generation and let EF manage FK assignment to ensure that related entities get the FKs to newly inserted rows automatically. Rather than SQL's NewId() or using Guid.NewGuid(), it is advisable to use sequential UUIDs. In SQL Server this is NewSequentialId(). For client-side setting, you can reproduce the sequential UUID pattern either with a system DLL call to get the ID, or a simple re-hash of the Guid bytes. see: Is there a .NET equalent to SQL Servers newsequentialid()
The GUIDs still carry the same uniqueness, the bytes are simply arranged to be more sequential and practical for database indexing to reduce page fragmentation. The downside is that IDs are more predictable. Depending on your database engine you might want to customize the algorithm based on whether the database is optimized for indexing on the lower-order or high-order bytes.
When using GUIDs for database, sequential or otherwise, you should ensure you have a scheduled index maintenance job on the database. With sequential IDs this job will run faster and keep the index tables more compact.
I am using NHibernate and C#.
I have two entities with many to many relationship between them.
When I delete the parent entity, I only have it's ID and I delete it with an hql query.
My problem is that it only deletes the parent entity without deleting it's relationships.
My Entities look like this:
public class Entity_A
{
public virtual int Code { get; set; }
public virtual int Id { get; set; }
public virtual ICollection<Entity_B> Entities_B { get; set; }
}
public class Entity_B
{
public virtual int Code { get; set; }
public virtual ICollection<Entity_A> Entities_A { get; set; }
}
mapping
public class EntityAMap : ClassMap<Entity_A>
{
public EntityAMap()
{
Table("ENTITY_A");
Id(x=>x.Code).GeneratedBy.Identity();
Map(x=>x.Id).column("A_ID").Not.Nullable();
HasManyToMany(x->x.Entities_B)
.LazyLoad()
.Generic()
.PropertyRef("Id")
.ChildKeyColumn("B_CODE")
.ParentKeyColumn("A_ID")
.Table("ENTITY_A_TO_ENTITY_B")
.Cascade.All();
}
}
public class EntityBMap : ClassMap<Entity_B>
{
public EntityBMap()
{
Table("ENTITY_B");
Id(x=>x.Code).GeneratedBy.Identity();
HasManyToMany(x->x.Entities_A)
.Generic()
.ChildPropertyRef("Code")
.ChildKeyColumn("A_ID")
.ParentKeyColumn("B_CODE")
.Table("ENTITY_A_TO_ENTITY_B")
.Cascade.All()
.Inverse();
}
}
My question is this, what should I change so that when I delete Entity_A with NHibernate HQL query it would also delete all it's relationships with Entity_B (from table ENTITY_A_TO_ENTITY_B).
In case, that your code is lookin like this:
ISession session = sessionFactory.OpenSession();
ITransaction tx = session.BeginTransaction();
String hqlDelete = "delete Entity_A ea where ea.Code = :code";
int deletedEntities = session.CreateQuery( hqlDelete )
.SetString( "code", codeToDelete )
.ExecuteUpdate();
tx.Commit();
session.Close();
(so if the code looks like the above) then:
we are not using HQL as a way how to load Entities into session (and let NHibernate do the magic) -
but we are using so called DML
see the doc:
13.3. DML-style operations
As already discussed, automatic and transparent object/relational mapping is concerned with the management of object state. This implies that the object state is available in memory, hence manipulating (using the SQL Data Manipulation Language (DML) statements: INSERT, UPDATE, DELETE) data directly in the database will not affect in-memory state...
Mostly this is the answer: ...will not affect in-memory state...
Simply, this way we do just use DML to effectively issue WRITE statement while using HQL (query language on top of our entities, not SQL)
SOLUTION:
1) Load the instance into memory. We can use HQL, or QueryOver, ICriteria... The important here is just to LOAD INTO MEMORY, resp into ISession
That way, NHiberante on DELETE could start to issue all the expected cascades...
2) use .CreateSQLQuery() to manually delete the rest as well:
session
.CreateSQLQuery("DELTE FROM ENTITY_A_TO_ENTITY_B WHERE ENTITY_A = :id)
.SetString( "id", idToDelete )
.ExecuteUpdate();
...
session.CreateQuery( hqlDelete )
This (second approach) will support effective SQL Statements (without loading instance into session), but will require a bit more coding on our side (NHibernate can cast spells just with its session)
I was hoping I could get some help with a performance problem I'm having in EntityFramework 4.4. I'm working on converting an application that was using EDMX files over to code first and I've run into a problem when running queries with a large number of objects in the "where" clause of the LINQ query.
Here's a short overview of how everything is laid out (Entity doesn't refer to EF, it's the name given to a generic "thing" in our code):
public class ExampleDbContext : DbContext
{
public DbSet<EntityTag> EntityTags { get; set; }
public DbSet<Entity> Entities { get; set; }
public DbSet<Log> Logs { get; set; }
protected override void OnModelCreating(DbmodelBuilder modelBuilder)
{
// Fluent mappings added to modelBuilder.Configurations.Add() in here
}
}
public class EntityTag
{
public int Id { get; set; }
public virtual Entity Entity { get; set; }
public int EntityId { get; set; }
public virtual Log Deleted { get; set; }
public int? DeletedId { get; set; }
}
public class Entity
{
public int Id { get; set; }
pulic byte[] CompositeId { get; set; }
}
// Used to log when an event happens
public class Log
{
public int Id { get; set; }
public string Username { get; set; }
public DateTime Timestamp { get; set; }
}
The query I'm running that causes the problem is:
// Creates an IEnumerable<byte[]> with the keys to find
var computedKeys = CreateCompositeIDs(entityKeys);
// Run the query and find any EntityTag that isn't deleted and is in
// the computedKeys list
var result = from et in Context.EntityTags
where computedKeys.Contains(et.Entity.CompositeId) &&
et.Deleted == null
select et;
var entityTags = result.ToList();
When computedKeys contains only a few Ids (15 for example) the code and query runs quickly. When I have a large number of Ids (1600 is normal at this point and it could get higher) it takes minutes (at 500, I haven't even tried with 1500 yet) to run that query once it's enumerated with ToList(). I've also removed the computedKeys.Contains() (leaving et.Deleted) from the query with a large number of computedKeys and the query ends up running quickly.
Through debugging I've determined that creating the list of keys is fast, so that's not the problem. When I hook a profiler up to MSSQL to see the query that's generated it looks normal in that all of the CompositeId's are included in a WHERE CompositeId IN ( /* List of Ids, could be 1500 of them */) and when the query shows up in the profiler it executes in less than a second so I don't think it's a database optimization thing, either. The profiler will sit there without anything showing up for the entire time it's running aside from the last second or so when it quickly returns a result.
I hooked up dotTrace and it looks like a lot of the time is spent within System.Data.Query.PlanCompiler.JoinGraph.GenerateTransitiveEdge(JoinEdge, JoinEdge) (119,640 ms) and System.Collections.Generic.List+Enumerator1.MoveNext` (54,270 ms) is called within that method twice, I think, based on the total execution time for each of them.
I just can't seem to figure out why it's taking so long to generate the query. It doesn't seem to be any faster the second time it executes after compiling, either, so it doesn't look like it's being cached.
Thanks in advance for the help!
I was able to figure it out. Once I decided not to be held to the original query and reconsidered the result, I rewrote the query to be:
var computedKeys = CreateCompositeIDs(entityKeys);
var entityTags = (from e in Context.Entities
where computedKeys.Contains(e.CompositeId)
from et in e.Tags
select et).Distinct();
entityTags = from et in entityTags
where et.Deleted == null
select et;
return entityTags;
When I started querying the entitites directly and took advantage of the relationship to EntityTag (which I forgot to include in the original question...) via Tags and then filtered only the existing EntityTag it sped up the query to the point where it's all running in under one second.
I am running into an interesting performance issue with Entity Framework. I am using Code First.
Here is the structure of my entities:
A Book can have many Reviews.
A Review is associated with a single Book.
A Review can have one or many Comments.
A Comment is associated with one Review.
public class Book
{
public int BookId { get; set; }
// ...
public ICollection<Review> Reviews { get; set; }
}
public class Review
{
public int ReviewId { get; set; }
public int BookId { get; set; }
public Book Book { get; set; }
public ICollection<Comment> Comments { get; set; }
}
public class Comment
{
public int CommentId { get; set; }
public int ReviewId { get; set; }
public Review Review { get; set; }
}
I populated my database with a lot of data and added the proper indexes. I am trying to retrieve a single book that has 10,000 reviews on it using this query:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.FirstOrDefault();
This particular book has 10,000 reviews. The performance of this query is around 4 seconds. Running the exact same query (via SQL Profiler) actually returns in no time at all. I used the same query and a SqlDataAdapter and custom objects to retrieve the data and it happens in under 500 milliseconds.
Using ANTS Performance Profiler it looks like a bulk of the time is being spent doing a few different things:
The Equals method is being called 50 million times.
Does anyone know why it would need to call this 50 million times and how I could increase the performance for this?
Why is Equals called 50M times?
It sounds quite suspicious. You have 10.000 reviews and 50.000.000 calls to Equals. Suppose that this is caused by identity map internally implemented by EF. Identity map ensures that each entity with unique key is tracked by the context only once so if context already has instance with the same key as loaded record from the database it will not materialize new instance and instead uses the existing one. Now how this can coincide with those numbers? My terrifying guess:
=============================================
1st record read | 0 comparisons
2nd record read | 1 comparison
3rd record read | 2 comparisons
...
10.000th record read | 9.999 comparisons
That means that each new record is compared with every existing record in identity map. By applying math to compute sum of all comparison we can use something called "Arithmetic sequence":
a(n) = a(n-1) + 1
Sum(n) = (n / 2) * (a(1) + a(n))
Sum(10.000) = 5.000 * (0 + 9.999) => 5.000 * 10.000 = 50.000.000
I hope I didn't make mistake in my assumptions or calculation. Wait! I hope I did mistake because this doesn't seem good.
Try turning off change tracking = hopefully turning off identity map checking.
It can be tricky. Start with:
var bookAndReviews = db.Books.Where(b => b.BookId == id)
.Include(b => b.Reviews)
.AsNoTracking()
.FirstOrDefault();
But there is a big chance that your navigation property will not be populated (because it is handled by change tracking). In such case use this approach:
var book = db.Books.Where(b => b.BookId == id).AsNoTracking().FirstOrDefault();
book.Reviews = db.Reviews.Where(r => r.BookId == id).AsNoTracking().ToList();
Anyway can you see what object type is passed to Equals? I think it should compare only primary keys and even 50M integer comparisons should not be such a problem.
As a side note EF is slow - it is well known fact. It also uses reflection internally when materializing entities so simply 10.000 records can take "some time". Unless you already did that you can also turn off dynamic proxy creation (db.Configuration.ProxyCreationEnabled).
I know this sounds lame, but have you tried the other way around, e.g.:
var reviewsAndBooks = db.Reviews.Where(r => r.Book.BookId == id)
.Include(r => r.Book);
I have noticed sometimes better performance from EF when you approach your queries this way (but I haven't had the time to figure out why).