I have the following function using entity framework, that checks a db table for expired product data, the entire query seems extremely slow and i am wondering if there is a better / more efficient way of adapting this query as there is always a bulk amount of data to clean sometimes upto 3k items.
It is purely the selection item that can take 4-10 mins dependant on db size on the day.
Datetime format Example = 2021-03-31T23:59:59.000
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
var rowData = db.ProductData.Where(le => Convert.ToDateTime(le.ProductEndDate.Value.ToString().Trim()) < dtCheck).ToList();
rowData.ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
log.Info($"Number of expired Products being removed: {rowData.Count()}");
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowData.Count == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowData.Count}");
}
Thanks in advance
EDIT::
Thanks to all the suggestions i will be looking at the ORM comments made by Panagiotis Kanavos below in regard to direct queries for this type of item and amended the column datatype based on the comment from Panagiotis Kanavos again. Finally the .ToList comment by jdweng removed the lag immediately so this at least gets my moving faster for now while i look at the suggestions by Panagiotis Kanavos as i think that is probably the best way forward?
Faster code is now:
using (ProductContext db = new ProductContext())
{
log.Info("Checking for expired data in the Product Data db");
//rule = now - configured hours, default = 2 hours in past.
var checkWindow = Convert.ToInt32(PRODUCTMAPPING_CONFIG.MinusExpiredWindowHours);
var dtCheck = Convert.ToDateTime(DateTime.Now.AddHours(-checkWindow).ToString("s"));
// Ammended DB Table from nvarchar to DateTime to allow direct comparison based on comment by: Panagiotis Kanavos
// Removed ToList() which returned the IEnumerable immediately based on comment by: jdweng
var rowData = db.ProductData.Where(le => le.ProductEndDate < dtCheck);
log.Info($"Number of expired Products being removed: {rowData.Count()}");
// added print out on debug only.
if(log.IsDebugEnabled)
rowData.ToList().ForEach(i => {log.Debug($"DB Row ID {i.Id} with Product ID Value: {i.ProductUid} has expired with Product End Date: {i.ProductEndDate}, marked for removal."); });
var rowCount = rowData.Count();
db.ProductData.RemoveRange(rowData);
db.SaveChanges();
log.Info(rowCount == 0
? "No expired data present."
: $"Number of expired assets Successfully removed from database = {rowCount}");
}
So grateful to all the useful comments below and grateful for the time you all took to respond and help me learn from this.
Related
I am writing a script that return all unprocessed partitions within a measure group using the following command:
objMeasureGroup.Partitions.Cast<Partition>().Where(x => x.State != AnalysisState.Processed)
After doing some experiments, it looks like this property indicates if the data is processed and doesn't mention the indexes.
After searching for hours, i didn't find any method to list the partitions where data is processed but indexes are not.
Any suggestions?
Environment:
SQL Server 2014
SSAS multidimensional cube
Script are written within a SSIS package / Script task
First, ProcessIndexes is an incremental operation. So if you run it twice the second time will be pretty quick because there is nothing to do. So I would recommend just running it on the cube and not worrying about whether it was previously run. However if you do need to analyze the current state then read on.
The best way (only way I know of) to distinguish whether ProcessIndexes has been run on a partition is to study the DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT DMVs as seen below.
The DISCOVER_PARTITION_STAT DMV returns one row per aggregation with the rowcount. The first row of that DMV has a blank aggregation name and represents the rowcount of the lowest level data processed in that partition.
The DISCOVER_PARTITION_DIMENSION_STAT DMV can tell you about whether indexes are processed and which range of values by each dimension attribute are in this partition (by internal IDs, so not super easy to interpret). We assume at least one dimension attribute is set to be optimized so it will be indexed.
You will need to add a reference to Microsoft.AnalysisServices.AdomdClient also to simplify running these DMVs:
string sDatabaseName = "YourDatabaseName";
string sCubeName = "YourCubeName";
string sMeasureGroupName = "YourMeasureGroupName";
Microsoft.AnalysisServices.Server s = new Microsoft.AnalysisServices.Server();
s.Connect("Data Source=localhost");
Microsoft.AnalysisServices.Database db = s.Databases.GetByName(sDatabaseName);
Microsoft.AnalysisServices.Cube c = db.Cubes.GetByName(sCubeName);
Microsoft.AnalysisServices.MeasureGroup mg = c.MeasureGroups.GetByName(sMeasureGroupName);
Microsoft.AnalysisServices.AdomdClient.AdomdConnection conn = new Microsoft.AnalysisServices.AdomdClient.AdomdConnection(s.ConnectionString);
conn.Open();
foreach (Microsoft.AnalysisServices.Partition p in mg.Partitions)
{
Console.Write(p.Name + " - " + p.State + " - ");
var restrictions = new Microsoft.AnalysisServices.AdomdClient.AdomdRestrictionCollection();
restrictions.Add("DATABASE_NAME", db.Name);
restrictions.Add("CUBE_NAME", c.Name);
restrictions.Add("MEASURE_GROUP_NAME", mg.Name);
restrictions.Add("PARTITION_NAME", p.Name);
var dsAggs = conn.GetSchemaDataSet("DISCOVER_PARTITION_STAT", restrictions);
var dsIndexes = conn.GetSchemaDataSet("DISCOVER_PARTITION_DIMENSION_STAT", restrictions);
if (dsAggs.Tables[0].Rows.Count == 0)
Console.WriteLine("ProcessData not run yet");
else if (dsAggs.Tables[0].Rows.Count > 1)
Console.WriteLine("aggs processed");
else if (p.AggregationDesign == null || p.AggregationDesign.Aggregations.Count == 0)
{
bool bIndexesBuilt = false;
foreach (System.Data.DataRow row in dsIndexes.Tables[0].Rows)
{
if (Convert.ToBoolean(row["ATTRIBUTE_INDEXED"]))
{
bIndexesBuilt = true;
break;
}
}
if (bIndexesBuilt)
Console.WriteLine("indexes have been processed. no aggs defined");
else
Console.WriteLine("no aggs defined. need to run ProcessIndexes on this partition to build indexes");
}
else
Console.WriteLine("need to run ProcessIndexes on this partition to process aggs and indexes");
}
I am posting this answer as additional information of #GregGalloway excellent answer
After searching for a while, the only way to know if partition are processed is using DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT.
I found an article posted by Daren Gossbel describing the whole process:
SSAS: Are my Aggregations processed?
In the artcile above the author provided two methods:
using XMLA
One way in which you can find it out with an XMLA discover call to the DISCOVER_PARTITION_STAT rowset, but that returns the results in big lump of XML which is not as easy to read as a tabular result set.
example
<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
<RequestType>DISCOVER_PARTITION_STAT</RequestType>
<Restrictions>
<RestrictionList>
<DATABASE_NAME>Adventure Works DW</DATABASE_NAME>
<CUBE_NAME>Adventure Works</CUBE_NAME>
<MEASURE_GROUP_NAME>Internet Sales</MEASURE_GROUP_NAME>
<PARTITION_NAME>Internet_Sales_2003</PARTITION_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
</PropertyList>
</Properties>
</Discover>
using DMV queries
If you have SSAS 2008, you can use the new DMV feature to query this same rowset and return a tabular result.
example
SELECT *
FROM SystemRestrictSchema($system.discover_partition_stat
,DATABASE_NAME = 'Adventure Works DW 2008'
,CUBE_NAME = 'Adventure Works'
,MEASURE_GROUP_NAME = 'Internet Sales'
,PARTITION_NAME = 'Internet_Sales_2003')
Similar posts:
How to find out using AMO if aggregation exists on partition?
Detect aggregation processing state with AMO?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?
static void ExportFromLegacy()
{
var usersQuery = _oldDb.users.Where(x =>
x.status == 'active');
int BatchSize = 1000;
var errorCount = 0;
var successCount = 0;
var batchCount = 0;
// Using MoreLinq's Batch for sequences
// https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
Console.WriteLine(String.Format("Batch count at {0}", batchCount));
batchCount++;
foreach(var user in batch)
{
try
{
var userData = _oldDb.userData.Where(x =>
x.user_id == user.user_id).ToList();
if (userData.Count > 0)
{
// Insert into table
var newData = new newData()
{
UserId = user.user_id; // shortened code for brevity.
};
_db.newUserData.Add(newData);
_db.SaveChanges();
// Insert item(s) into table
foreach (var item in userData.items)
{
if (!_db.userDataItems.Any(x => x.id == item.id)
{
var item = new Item()
{
UserId = user.user_id, // shortened code for brevity.
DataId = newData.id // id from object created above
};
_db.userDataItems.Add(item);
}
_db.SaveChanges();
successCount++;
}
}
}
catch(Exception ex)
{
errorCount++;
Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
Console.WriteLine("Message: " + ex.Message);
Console.WriteLine("InnerException: " + ex.InnerException);
}
}
}
Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(#"hh\:mm\:ss")));
}
Unfortunately, the major issue is the number of database round-trip.
You make a round-trip:
For every user, you retrieve user data by user id in the old database
For every user, you save user data in the new database
For every user, you save user data item in the new database
So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.
The only way to dramatically improve the performance is by reducing the number of database round-trip.
Batch - Retrieve user by id
You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
foreach(var userData in userDatas)
{
....
}
You should already have saved a few hours only with this change.
Batch - Save Changes
Every time you save a user data or item, you perform a database round-trip.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library allows to perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.
You will, however, have to use a relation to the newData instance instead of using the ID directly.
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
_db.BulkInsert(newDatas);
_db.BulkInsert(newDataItems);
}
EDIT: Answer subquestion
One of the properties of a newDataItem, is the id of newData. (ex.
newDataItem.newDataId.) So newData would have to be saved first in
order to generate its id. How would I BulkInsert if there is a
dependency of an another object?
You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.
public class UserData
{
public int UserDataID { get; set; }
// ... properties ...
public List<UserDataItem> Items { get; set; }
}
public class UserDataItem
{
public int UserDataItemID { get; set; }
// ... properties ...
public UserData OwnerData { get; set; }
}
var userData = new UserData();
var userDataItem = new UserDataItem();
// Use navigation property to set the parent.
userDataItem.OwnerData = userData;
Tutorial: Configure One-to-Many Relationship
Also, I don't see a BulkSaveChanges in your example code. Would that
have to be called after all the BulkInserts?
Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)
Here is an example using BulkSaveChanges:
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
var context = new UserContext();
context.userDatas.AddRange(newDatas);
context.userDataItems.AddRange(newDataItems);
context.BulkSaveChanges();
}
BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.
In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.
Entity Framework is a very bad choice for importing large amounts of data. I know this from personal experience.
That being said, I found a few ways to optimize things when I tried to use it in the same way you are.
The Context will cache objects as you add them, and the more inserts you do, the slower future inserts will get. My solution was to limit each context to about 500 inserts before I disposed of that instance and created a new one. This boosted performance significantly.
I was able to make use of multiple threads to increase performance, but you will have to be very careful about resource contention. Each thread will definitely need its own Context, don't even think about trying to share it between threads. My machine had 8 cores, so threading will probably not help you as much; with a single core I doubt it will help you at all.
Turn off ChangeTracking with AutoDetectChangesEnabled = false;, change tracking is incredibly slow. Unfortunately this means you have to modify your code to make all changes directly through the context. No more Entity.Property = "Some Value";, it becomes Context.Entity(e=> e.Property).SetValue("Some Value"); (or something like that, I don't remember the exact syntax), which makes the code ugly.
Any queries you do should definitely use AsNoTracking.
With all that, I was able to cut a ~20 hour process down to about 6 hours, but I still don't recommend using EF for this. It was an extremely painful project due almost entirely to my poor choice of EF to add data. Please use something else... anything else...
I don't want to give the impression that EF is a bad data access library, it is great at what it was designed to do, unfortunately this is not what it was designed for.
I can think on a few options.
1) A little speed increase could be done by moving your _db.SaveChanges() under your foreach() close bracket
foreach (...){
}
successCount += _db.SaveChanges();
2) Add items to a list, and then to context
List<ObjClass> list = new List<ObjClass>();
foreach (...)
{
list.Add(new ObjClass() { ... });
}
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
3) If it's a big amount of dada, save on bunches
List<ObjClass> list = new List<ObjClass>();
int cnt=0;
foreach (...)
{
list.Add(new ObjClass() { ... });
if (++cnt % 100 == 0) // bunches of 100
{
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
// Optional if a HUGE amount of data
if (cnt % 1000 == 0)
{
_db = new MyDbContext();
}
}
}
// Don't forget that!
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
4) If TOOOO big, considere using bulkinserts. There are a few examples on internet and a few free libraries around.
Ref: https://blogs.msdn.microsoft.com/nikhilsi/2008/06/11/bulk-insert-into-sql-from-c-app/
On most of these options you loose some control on error handling as it is difficult to know which one failed.
Quick disclaimer first: I am a complete noob when it comes to databases. I know how to form an SQL query and... well, that's pretty much it - I figured that'd be enough to start with. Performance optimizations would come later.
'Later' has arrived. I need your help.
I'm doing NLP-processing on news articles. The articles are taken from the Internet and stored in a database. Users give me an input period to analyze, I bring up all the articles in this period, analyze them and show them some graphs in return. I currently have a rather naive approach to this - I don't limit the number of articles returned. About 250 articles a day * 6 months is 45,000 records, a rather large number.
I'm experiencing mediocre fetch performance. I'm using C# + SQLCE (an easy DB to start with, with no set up cost). I tried indexing the database to no avail. I'm suspecting the problems comes from either
asking for so much data in one single query.
using SQLCE
Am I utterly crazy to try and fetch thousands of records all in 1 call ? Was SQLCE a stupid choice to make ? I basically need practical advice on this. Also, if you could point me to good alternatives to solve my problem that's even more awesome.
Your help is of great value to me - thanks in advance!
EDIT - Below is to command I use to get my articles:
using (SqlCeCommand com1 = new SqlCeCommand(mySqlRequestString, myConnectionString))
{
SqlCeResultSet res = com1.ExecuteResultSet(ResultSetOptions.Scrollable);
if (res.HasRows)
{
//Use the get ordinal method so we don’t have to worry about remembering what order our SQL put the field names in.
int ordGuid = res.GetOrdinal("Id"); int ordUrl = res.GetOrdinal("Url"); int ordPublicationDate = res.GetOrdinal("PublicationDate");
int ordTitle = res.GetOrdinal("Title"); int ordContent = res.GetOrdinal("Content"); int ordSource = res.GetOrdinal("Source");
int ordAuthor = res.GetOrdinal("Author"); int ordComputedKeywords = res.GetOrdinal("ComputedKeywords"); int ordComputedKeywordsDate = res.GetOrdinal("ComputedKeywordsDate");
//Get all the Articles
List<Article> articles = new List<Article>();
if (res.ReadFirst())
{
// Read the first record and get its data
res.ReadFirst();
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords)?new string[]{}: res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
// Read the remaining records
while (res.Read())
{
Constants.Sources src = (Constants.Sources)Enum.Parse(typeof(Constants.Sources), res.GetString(ordSource));
string[] computedKeywords = res.IsDBNull(ordComputedKeywords) ? new string[] { } : res.GetString(ordComputedKeywords).Split(',').ToArray();
DateTime computedKeywordsDate = res.IsDBNull(ordComputedKeywordsDate) ? new DateTime() : res.GetDateTime(ordComputedKeywordsDate);
articles.Add(new Article(res.GetGuid(ordGuid), new Uri(res.GetString(ordUrl)), res.GetDateTime(ordPublicationDate), res.GetString(ordTitle), res.GetString(ordContent), src, res.GetString(ordAuthor), computedKeywords, computedKeywordsDate));
}
return articles.ToArray();
}
}
You should only fetch one page of results at a time.
SqlCE is great for testing or very low usage applications, but you should really consider SQL Express or full blown SQL Server.
I need insert to large amount of data to sqlite db.
I uses Linq to Entities.
I have problem to adding large amount of data 1M+.
Not enough memory or a very long time.
This code - fast, but requires a lot of memory:
// query - IQueryable of DbfRecord
// db - ObjectContext
int i = 0;
foreach (var item in query) {
db.AddToKladrs(new Kladr() {
Id = item.GetField(0),
ParentId = item.GetField(1),
RegionId = item.GetField(3),
Name = item.GetField(2),
Index = item.GetField(4)
});
if(++i % 4000 == 0)
db.SaveChanges(SaveOptions.AcceptAllChangesAfterSave);
}
This code - not resource-intensive, but very slow:
// query - IQueryable of DbfRecord
// db - ObjectContext
foreach (var item in query) {
db.ExecuteStoreCommand("insert into [Kladr] values({0}, {1}, {2}, {3}, {4})",
item.GetField(0),
item.GetField(1),
item.GetField(3),
item.GetField(2),
item.GetField(4)
);
}
I missed the try-catch construction and ghost types.
Help me find the best solution!
You can use the SqlBulkCopy for copying large amounts of data. Havn't tried it with SQL lite but it should work.
Link 1
Link 2
Update :
Here is a good answer by Marc Gravell. how-to-do-a-bulk-insert-linq-to-entities
I am writing a small application that does a lot of feed processing. I want to use LINQ EF for this as speed is not an issue, it is a single user app and, in the end, will only be used once a month.
My questions revolves around the best way to do bulk inserts using LINQ EF.
After parsing the incoming data stream I end up with a List of values. Since the end user may end up trying to import some duplicate data I would like to "clean" the data during insert rather than reading all the records, doing a for loop, rejecting records, then finally importing the remainder.
This is what I am currently doing:
DateTime minDate = dataTransferObject.Min(c => c.DoorOpen);
DateTime maxDate = dataTransferObject.Max(c => c.DoorOpen);
using (LabUseEntities myEntities = new LabUseEntities())
{
var recCheck = myEntities.ImportDoorAccess.Where(a => a.DoorOpen >= minDate && a.DoorOpen <= maxDate).ToList();
if (recCheck.Count > 0)
{
foreach (ImportDoorAccess ida in recCheck)
{
DoorAudit da = dataTransferObject.Where(a => a.DoorOpen == ida.DoorOpen && a.CardNumber == ida.CardNumber).First();
if (da != null)
da.DoInsert = false;
}
}
ImportDoorAccess newIDA;
foreach (DoorAudit newDoorAudit in dataTransferObject)
{
if (newDoorAudit.DoInsert)
{
newIDA = new ImportDoorAccess
{
CardNumber = newDoorAudit.CardNumber,
Door = newDoorAudit.Door,
DoorOpen = newDoorAudit.DoorOpen,
Imported = newDoorAudit.Imported,
RawData = newDoorAudit.RawData,
UserName = newDoorAudit.UserName
};
myEntities.AddToImportDoorAccess(newIDA);
}
}
myEntities.SaveChanges();
}
I am also getting this error:
System.Data.UpdateException was unhandled
Message="Unable to update the EntitySet 'ImportDoorAccess' because it has a DefiningQuery and no element exists in the element to support the current operation."
Source="System.Data.SqlServerCe.Entity"
What am I doing wrong?
Any pointers are welcome.
You can do multiple inserts this way.
I've seen the exception you're getting in cases where the model (EDMX) is not set up correctly. You either don't have a primary key (EntityKey in EF terms) on that table, or the designer has tried to guess what the EntityKey should be. In the latter case, you'll see two or more properties in the EDM Designer with keys next to them.
Make sure the ImportDoorAccess table has a single primary key and refresh the model.