Improve the query

Improve the query - c#

I have two List. One with new objects, lets call it NewCol, and one that I have read out of my database(lets call it CurrentCol). The CurrentCol is about 1.2 million rows(and will grow daily) and the NewCol can be any thing from 1 to thousands of rows at a time. Can any one please help me improve the following code as it runs way too slow:
foreach (PortedNumberCollection element in NewCol)
{
if (CurrentCol.AsParallel().Select(x => x.MSISDN).Contains(element.MSISDN))
{
CurrentCol.AsParallel().Where(y => y.MSISDN == element.MSISDN)
.Select
(x =>
{
//if the MSISDN exists in both the old and new collection
// then update the old collection
x.RouteAction = element.RouteAction;
x.RoutingLabel = element.RoutingLabel;
return x;
}
);
}
else
{
//The item does not exist in the old collection so add it
CurrentCol.Add(element);
}
}

It's a pretty bad idea to read the whole database into memory and search it there. Searching is what databases are optimized for. The best thing you can do is to move that whole code into the database somehow, for example by executing a MERGE statement for each item in NewCol.
However, if you can't do that, the next best thing would be to make CurrentCol a dictionary with MSISDN as the key:
// Somewhere... Only once, not every time you want to add new items
// In fact, there shouldn't be any CurrentCol, only the dictionary.
var currentItems = CurrentCol.ToDictionary(x => x.MSISDN, x => x);
// ...
foreach(var newItem in NewCol)
{
PortedNumberCollection existingItem;
if(currentItems.TryGetValue(newItem.MSISDN, out existingItem)
{
existingItem.RouteAction = newItem.RouteAction;
existingItem.RoutingLabel = newItem.RoutingLabel;
}
else
{
currentItems.Add(newItem.MSISDN, newItem);
}
}

Related

How to Optimize Code Performance in .NET [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?
static void ExportFromLegacy()
{
var usersQuery = _oldDb.users.Where(x =>
x.status == 'active');
int BatchSize = 1000;
var errorCount = 0;
var successCount = 0;
var batchCount = 0;
// Using MoreLinq's Batch for sequences
// https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
Console.WriteLine(String.Format("Batch count at {0}", batchCount));
batchCount++;
foreach(var user in batch)
{
try
{
var userData = _oldDb.userData.Where(x =>
x.user_id == user.user_id).ToList();
if (userData.Count > 0)
{
// Insert into table
var newData = new newData()
{
UserId = user.user_id; // shortened code for brevity.
};
_db.newUserData.Add(newData);
_db.SaveChanges();
// Insert item(s) into table
foreach (var item in userData.items)
{
if (!_db.userDataItems.Any(x => x.id == item.id)
{
var item = new Item()
{
UserId = user.user_id, // shortened code for brevity.
DataId = newData.id // id from object created above
};
_db.userDataItems.Add(item);
}
_db.SaveChanges();
successCount++;
}
}
}
catch(Exception ex)
{
errorCount++;
Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
Console.WriteLine("Message: " + ex.Message);
Console.WriteLine("InnerException: " + ex.InnerException);
}
}
}
Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(#"hh\:mm\:ss")));
}

Unfortunately, the major issue is the number of database round-trip.
You make a round-trip:
For every user, you retrieve user data by user id in the old database
For every user, you save user data in the new database
For every user, you save user data item in the new database
So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.
The only way to dramatically improve the performance is by reducing the number of database round-trip.
Batch - Retrieve user by id
You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
foreach(var userData in userDatas)
{
....
}
You should already have saved a few hours only with this change.
Batch - Save Changes
Every time you save a user data or item, you perform a database round-trip.
Disclaimer: I'm the owner of the project Entity Framework Extensions
This library allows to perform:
BulkSaveChanges
BulkInsert
BulkUpdate
BulkDelete
BulkMerge
You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.
You will, however, have to use a relation to the newData instance instead of using the ID directly.
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
_db.BulkInsert(newDatas);
_db.BulkInsert(newDataItems);
}
EDIT: Answer subquestion
One of the properties of a newDataItem, is the id of newData. (ex.
newDataItem.newDataId.) So newData would have to be saved first in
order to generate its id. How would I BulkInsert if there is a
dependency of an another object?
You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.
public class UserData
{
public int UserDataID { get; set; }
// ... properties ...
public List<UserDataItem> Items { get; set; }
}
public class UserDataItem
{
public int UserDataItemID { get; set; }
// ... properties ...
public UserData OwnerData { get; set; }
}
var userData = new UserData();
var userDataItem = new UserDataItem();
// Use navigation property to set the parent.
userDataItem.OwnerData = userData;
Tutorial: Configure One-to-Many Relationship
Also, I don't see a BulkSaveChanges in your example code. Would that
have to be called after all the BulkInserts?
Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)
Here is an example using BulkSaveChanges:
foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
// Retrieve all users for the batch at once.
var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
.AsNoTracking()
.Where(x => list.Contains(x.user_id))
.ToList();
// Create list used for BulkInsert
var newDatas = new List<newData>();
var newDataItems = new List<Item();
foreach(var userData in userDatas)
{
// newDatas.Add(newData);
// newDataItem.OwnerData = newData;
// newDataItems.Add(newDataItem);
}
var context = new UserContext();
context.userDatas.AddRange(newDatas);
context.userDataItems.AddRange(newDataItems);
context.BulkSaveChanges();
}
BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.
In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.

Entity Framework is a very bad choice for importing large amounts of data. I know this from personal experience.
That being said, I found a few ways to optimize things when I tried to use it in the same way you are.
The Context will cache objects as you add them, and the more inserts you do, the slower future inserts will get. My solution was to limit each context to about 500 inserts before I disposed of that instance and created a new one. This boosted performance significantly.
I was able to make use of multiple threads to increase performance, but you will have to be very careful about resource contention. Each thread will definitely need its own Context, don't even think about trying to share it between threads. My machine had 8 cores, so threading will probably not help you as much; with a single core I doubt it will help you at all.
Turn off ChangeTracking with AutoDetectChangesEnabled = false;, change tracking is incredibly slow. Unfortunately this means you have to modify your code to make all changes directly through the context. No more Entity.Property = "Some Value";, it becomes Context.Entity(e=> e.Property).SetValue("Some Value"); (or something like that, I don't remember the exact syntax), which makes the code ugly.
Any queries you do should definitely use AsNoTracking.
With all that, I was able to cut a ~20 hour process down to about 6 hours, but I still don't recommend using EF for this. It was an extremely painful project due almost entirely to my poor choice of EF to add data. Please use something else... anything else...
I don't want to give the impression that EF is a bad data access library, it is great at what it was designed to do, unfortunately this is not what it was designed for.

I can think on a few options.
1) A little speed increase could be done by moving your _db.SaveChanges() under your foreach() close bracket
foreach (...){
}
successCount += _db.SaveChanges();
2) Add items to a list, and then to context
List<ObjClass> list = new List<ObjClass>();
foreach (...)
{
list.Add(new ObjClass() { ... });
}
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
3) If it's a big amount of dada, save on bunches
List<ObjClass> list = new List<ObjClass>();
int cnt=0;
foreach (...)
{
list.Add(new ObjClass() { ... });
if (++cnt % 100 == 0) // bunches of 100
{
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
// Optional if a HUGE amount of data
if (cnt % 1000 == 0)
{
_db = new MyDbContext();
}
}
}
// Don't forget that!
_db.newUserData.AddRange(list);
successCount += _db.SaveChanges();
list.Clear();
4) If TOOOO big, considere using bulkinserts. There are a few examples on internet and a few free libraries around.
Ref: https://blogs.msdn.microsoft.com/nikhilsi/2008/06/11/bulk-insert-into-sql-from-c-app/
On most of these options you loose some control on error handling as it is difficult to know which one failed.

c# linq list with varying where conditions

private void getOrders()
{
try
{
//headerFileReader is assigned with a CSV file (not shown here).
while (!headerFileReader.EndOfStream)
{
headerRow = headerFileReader.ReadLine();
getOrderItems(headerRow.Substring(0,8))
}
}
}
private void getOrderItems(string ordNum)
{
// lines is an array assigned with a CSV file...not shown here.
var sorted = lines.Skip(1).Select(line =>
new
{
SortKey = (line.Split(delimiter)[1]),
Line = line
})
.OrderBy(x => x.SortKey)
.Where(x => x.SortKey == ordNum);
//Note ordNum is different every time when it is passed.
foreach (var orderItems in sorted) {
//Process each line here.
}
}
Above is my code. What I am doing is for every order number from headerFile, I process the detailLines. I would like to only search for those lines specific to the order nr. The above logic works fine but it reads with where clause for every order number which simply is not required as well as delays the process.
I basically want to have getOrderItems something like below but I can't get as the sorted can't be passed but I think it should be possible??
private void getOrderItems(string ordNum)
{
// I would like to have sorted uploaded with data elsewhere and I pass it this function and reference it by other means but I am not able to get it.
var newSorted = sorted.Where(x => x.SortKey == docNum);
foreach (var orderItems in newSorted) {
//Process each line here.
}
}
Please suggest.
UPDATE : Thanks for the responses & improvements but my main question is I don't want to create the list every time (like I have shown in my code). What I want is to create the list first time and then only search within the list for a particular value (here docNum as shown). Please suggest.

It might be a good idea to preprocess your input lines and build a dictionary, where each distinct sort key maps to a list of lines. Building the dictionary is O(n), and after that you get constant time O(1) lookups:
// these are your unprocessed file lines
private string[] lines;
// this dictionary will map each `string` key to a `List<string>`
private Dictionary<string, List<string>> groupedLines;
// this is the method where you are loading your files (you didn't include it)
void PreprocessInputData()
{
// you already have this part somewhere
lines = LoadLinesFromCsv();
// after loading, group the lines by `line.Split(delimiter)[1]`
groupedLines = lines
.Skip(1)
.GroupBy(line => line.Split(delimiter)[1])
.ToDictionary(x => x.Key, x => x.ToList());
}
private void ProcessOrders()
{
while (!headerFileReader.EndOfStream)
{
var headerRow = headerFileReader.ReadLine();
List<string> itemsToProcess = null;
if (groupedLines.TryGetValue(headerRow, out itemsToProcess))
{
// if you are here, then
// itemsToProcess contains all lines where
// (line.Split(delimiter)[1]) == headerRow
}
else
{
// no such key in the dictionary
}
}
}

The following will get your way and also be more efficient.
var sorted = lines.Skip(1)
.Where(line => (line.Split(delimiter)[1] == ordNum))
.Select(
line =>
new
{
SortKey = (line.Split(delimiter)[1]),
Line = line
}
)
.OrderBy(x => x.SortKey);

Compare one value to several possible list matches

I'm back to haunt your dreams! I'm working on comparing some values in a complex loop. List 1 is a list of questions/answers, List 2 is also a list of questions/answers. I want to compare List 1 to List 2 and have duplicates removed from List 1 before merging it with List 2. My problem is in the current seed data I have the two items in List 1 match against List 2, but only one is removed instead of both.
I've been at this a couple days and my head is ready to explode, so I hope I can find some help!
Here's code for you:
//Fetching questions/answers which do not have an attempt
//Get questions, which automatically pull associated answers thanks to the model
List<QuizQuestions> notTriedQuestions = await db.QuizQuestions.Where(x=>x.QuizID == report.QuizHeader.QuizID).ToListAsync();
//Compare to existing attempt data and remove duplicate questions
int i = 0;
while(i < notTriedQuestions.Count)
{
var originalAnswersCount = notTriedQuestions.ElementAt(i).QuizAnswers.Count;
int j = 0;
while(j < originalAnswersCount)
{
var comparedID = notTriedQuestions.ElementAt(i).QuizAnswers.ElementAt(j).AnswerID;
if (report.QuizHeader.QuizQuestions.Any(item => item.QuizAnswers.Any(x => x.AnswerID == comparedID)))
{
notTriedQuestions.RemoveAt(i);
//Trip while value and cause break out of loop, otherwise you result in a catch
j = originalAnswersCount;
}
else
{
j++;
}
}
i++;
}
//Add filtered list to master list
foreach (var item in notTriedQuestions)
{
report.QuizQuestions.Add(item);
}

Try List.Union It is meant for exactly this sort of thing.

Newbie performance issue with foreach ...need advice

This section simply reads from an excel spreadsheet. This part works fine with no performance issues.
IEnumerable<ImportViewModel> so=data.Select(row=>new ImportViewModel{
PersonId=(row.Field<string>("person_id")),
ValidationResult = ""
}).ToList();
Before I pass to a View I want to set ValidationResult so I have this piece of code. If I comment this out the model is passed to the view quickly. When I use the foreach it will take over a minute. If I hardcode a value for item.PersonId then it runs quickly. I know I'm doing something wrong, just not sure where to start and what the best practice is that I should be following.
foreach (var item in so)
{
if (db.Entity.Any(w => w.ID == item.PersonId))
{
item.ValidationResult = "Successful";
}
else
{
item.ValidationResult = "Error: ";
}
}
return View(so.ToList());

You are now performing a database call per item in your list. This is really hard on your database and thus your performance. Try to itterate trough your excel result, gather all users and select them in one query. Make a list from this query result (else the query call is performed every time you access the list). Then perform a match between the result list and your excel.

You need to do something like this :
var ids = so.Select(i=>i.PersonId).Distinct().ToList();
// Hitting Database just for this time to get all Users Ids
var usersIds = db.Entity.Where(u=>ids.Contains(u.ID)).Select(u=>u.ID).ToList();
foreach (var item in so)
{
if (usersIds.Contains(item.PersonId))
{
item.ValidationResult = "Successful";
}
else
{
item.ValidationResult = "Error: ";
}
}
return View(so.ToList());

Compare adjacent list items

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:
public void GetCRCs(List<DupInfo> dupInfos)
{
var crc = new Crc32();
for (int i = 0; i < dupInfos.Count(); i++)
{
if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
{
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
My question is:
How can I compare each entry to its neighbors without the out of bounds error?
Should I be using a loop for this, or is there a better LINQ or other function?
Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

Compute the Crcs first:
// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
dupInfos[0].CheckSum = null ;
for (int i = 1; i < dupInfos.Count(); i++)
{
dupInfos[i].CheckSum = null ;
if (dupInfos[i].Size == dupInfos[i - 1].Size)
{
if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
}
}
}
After having sorted your files by size and crc, identify duplicates:
public void GetDuplicates(List<DupInfo> dupInfos)
{
for (int i = dupInfos.Count();i>0 i++)
{ // loop is inverted to allow list items deletion
if (dupInfos[i].Size == dupInfos[i - 1].Size &&
dupInfos[i].CheckSum != null &&
dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
{ // i is duplicated with i-1
... // your code here
... // eventually, dupInfos.RemoveAt(i) ;
}
}
}

I have sorted my list of files by size, and am looping through to
compare each element to the ones above and below it.
The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.
I suggest taking this approach
Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.
Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.
The code might look something like this:
var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
.Where(group => group.Count() > 1);
foreach (var grp in filesSetsWithPossibleDupes)
{
var checksums = new List<CRC32CheckSum>(); //or whatever type
foreach (var file in grp)
{
var currentCheckSum = crc.ComputeChecksum(file);
if (checksums.Contains(currentCheckSum))
{
//Found a duplicate
}
else
{
checksums.Add(currentCheckSum);
}
}
}
Or if you need the specific objects that could be duplicates, the inner foreach loop might look like
var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
.Where(grp => grp.Count() > 1);
var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates
foreach (var grp in filesSetsWithPossibleDupes)
{
var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
.Where(g => g.Count() > 1);
//Same GroupBy logic, but applied to the checksum (instead of file size)
foreach(var dupGrp in likelyDuplicates)
{
//Create the key for the dictionary (your code is likely different)
var sample = dupGrp.First();
var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
masterDuplicateDict.Add(key, dupGrp);
}
}
A demo of this idea.

I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)
var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
...
});

Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.
https://stackoverflow.com/a/13505715/1856992
Edit: Sorry for some reason I thought you were comparing file name not size.
So here is an actual answer for you.
using System;
using System.Collections.Generic;
using System.Linq;
public class ObjectWithSize
{
public int Size {get; set;}
public ObjectWithSize(int size)
{
Size = size;
}
}
public class Program
{
public static void Main()
{
Console.WriteLine("start");
var list = new List<ObjectWithSize>();
list.Add(new ObjectWithSize(12));
list.Add(new ObjectWithSize(13));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(14));
list.Add(new ObjectWithSize(18));
list.Add(new ObjectWithSize(15));
list.Add(new ObjectWithSize(15));
var duplicates = list.GroupBy(x=>x.Size)
.Where(g=>g.Count()>1);
foreach (var dup in duplicates)
foreach (var objWithSize in dup)
Console.WriteLine(objWithSize.Size);
}
}
This will print out
14
14
15
15
Here is a netFiddle for that.
https://dotnetfiddle.net/0ub6Bs
Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.