Keeping track of long repetitive operation - c#

In my code I need to perform an operation on lots of different postcodes that can take up to a few seconds each operation and can include thousand of postcodes that need to be processed. I need to be able to keep track of it and restart it if it is terminated for whatever reason.
I had the idea to create a search entity like so:
public class Search
{
public int ID { get; set; }
public virtual ICollection<PostCode> PostCodes { get; set; }
}
public class PostCode
{
public int ID { get; set; }
public string Value { get; set; }
}
I am having trouble working out how I keep track of it. My first thought was after each successful operation to remove the PostCode from the collection in Search and save it so each time I load the object it will have only the unprocessed postcodes but when I do this it throws an exception.
Like so:
using (var db = new MyDbContext())
{
foreach (var pc in search.PostCodes)
{
DoSomeStuff(pc);
search.PostCodes.Remove(pc);
db.SaveChanges();
}
}
I understand this is because I can't change the object I am enumerating over but for some reason I can't think of a simple non-convoluted way to keep track of the search in the database. Also due to the fact that there are many thousand postcodes I am concerned here about performance also.
Can anyone suggest how I should be keeping track of the search?
EDIT
So if I keep track of number processed in a completed property will this work properly?
using (var db = new MyDbContext())
{
foreach (var pc in search.PostCodes.Skip(search.Completed))
{
DoSomeStuff(pc);
search.Completed ++;
db.SaveChanges();
}
}
If I save and then load an entity with Entity Framework does it always keep the IEnumerable in the same order?
Edit #2
This is where in my actual code I load the search object:
rs = db.RadarSearches.Include("PostCodes").FirstOrDefault(x => x.Keyword.Value == keyword && x.Complete == false);
I guess if I enforce ordering here then I can make it consistent. How do I order the nested PostCodes collection?

store the result as an array and store that array in a file, so that you can resume (assuming the search result may vary at different times). with the file you need to save the index at where you have been. on resume of your process you load the array from the file and the last successful processed index.
you can store the whole object and remove successful processed ones, but for performance's sake you just save the file on every 5th or 10th procession.

Related

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets

In the following code in each iteration of the loop I do the following things:
Fetch 5000 entities from DB
According to the logic in FilterRelationsToDelete I decide which entities to delete
I add the ids of entities to delete to a collection
After I finish the loop I delete from DB the entities according to idsToDelete collection.
I saw in visual studio "diagnostics tool" that the memory of the process is rising in the beginning of each loop iteration and after the iteration finishes it decreases by half, My problem is that sometimes it raises to 800MB and drops to 400MB, sometimes it is steady on 200MB, and sometimes it is over 1GB and drops to 500MB and stay steady on that.
I am not sure why my process memory is not steady on 200MB with small spikes when the data arrives from the DB. what might be the reasons for that? maybe Entity framework does not free all the memory it used? maybe the GC I activated here on purpose does not clean all the memory as I expected? maybe I have a bug here that I am not aware of?
The list of longs memory that I accumulate in idsToDelete is almost zero, this is not the problem.
Is there any way to write this code better?
private static void PlayWithMemory()
{
int buketSize = 5000;
List<long> idsToDelete = new List<long>();
for (int i = 0; i < 500; i++)
{
System.GC.Collect();//added just for this example
using (var context = new PayeeRelationsContext())
{
int toSkip = i * bucketSize;
List<PayeeRelation> dbPayeeRelations = GetDBRelations(context, toSkip, buketSize);
var relationsToDelete = FilterRelationsToDelete(dbPayeeRelations);
List<long> ids = relationsToDelete.Select(x => x.id).ToList();
idsToDelete.AddRange(ids);
Console.WriteLine($"i = {i}, toSkip = {toSkip}, payeeRelations.Count = {payeeRelationsIds.Count}");
}
}
}
private static List<PayeeRelation> GetDBRelations(PayeeRelationsContext context, int toSkip,
int bucketSize)
{
return context.PayeeRelations
.OrderBy(x => x.id)
.Include(x => x.PayeeRelation_PayeeVersion)
.Skip(toSkip)
.Take(bucketSize)
.AsNoTracking()
.ToList();
}
I don't see anything inherently wrong with your code to indicate a memory leak. I believe what you are observing is simply that the garbage collection does not fully "release" memory as soon as the references are deemed unused or out of scope.
If memory use/allocation is a concern then you should consider projecting down to the minimal viable data you need to validate in order to identify which IDs need to be deleted. For example, if you need the ID and Field1 from the PayeeRelations, then need Field2 and Field3 from the related PayeeVersion:
private class RelationValidationDetails
{
public long PayeeRelationId { get; set; }
public string Field1 { get; set; }
public string Field2 { get; set; }
public DateTime Field3 { get; set; }
}
....then in your query:
var validationData = context.PayeeRelations
.OrderBy(x => x.id)
.Select(x => new RelationValidationDetails
{
PayeeRelationId = x.Id,
Field1 = x.Field1,
Field2 = x.PayeeRelation_PayeeVersion.Field2,
Field3 = x.PayeeRelation_PayeeVersion.Field3
}).Skip(toSkip)
.Take(bucketSize)
.ToList();
Then your validation just takes the above collection of validation details to determine which IDs need to be deleted. (assuming it bases this decision on Fields 1-3) This ensures that your query only returns back exactly what data is needed to ultimately get the IDs to delete, minimizing memory growth.
There could be an argument that if later a "Field4" is required to do the validation would mean you have to update this object definition and revise the query which is extra work when you could just use the entity. However, Field4 might not come from PayeeRelations or the PayeeVersion, it might come from a different related entity which currently isn't eager loaded. This would introduce an overhead of having to add the cost of eager loading another table for every caller of that wrapped GetPayeeRelations call, whether they need that data or not. That, or risking performance hits from lazy loading (removing the AsNoTracking()) or introducing conditional complexity to tell the GetPayeeRelations which relationships need to be eager loaded. Trying to predict this possibility is really just an example of YAGNI.
I generally don't recommend hiding EF queries behind getter methods (such as generic repositories) simply because these tend to form a lowest-common denominator while chasing DNRY or SRP. The reality is that they end up being single points that are inefficient in many cases because if any one consumer needs a relationship eager loaded, all consumers get it eager loaded. It's generally far better to allow your consumers to have the ability to project down to just exactly what they need rather than worry that similar (rather than identical) queries might appear in multiple places.

C# - Sort items of linked list by date

I am having a Node
public class Node
{
public Node(Pupil pupil)
{
Data = pupil;
Next = null;
}
public Pupil Data { get; set; }
public Node Next { get; set; }
}
which holds the date of birth of a student. I insert objects to the list like this
public void Insert(Pupil pupil)
{
if (_head == null)
{
_head = new Node(pupil);
return;
}
Node current = _head;
while (current.Next != null)
{
current = current.Next;
}
current.Next = new Node(pupil);
}
But now i have the problem to sort the pupils by date of birth (pupil.DateOfBirth). Since I am new to Lists etc. I dont have an idea how to do this. Since corona is here we have to basically teach it ourselves in university.
Help would be really appreciated.
Suppose you have an object called unorderedList (containing your pupils in an unordered fashion) and an object called orderedList, which initially is empty. Then you can do the following:
Take an element of unorderedList. It is not important which element you take. For example, you can always take the first one (which is also a good idea if the list is already partially ordered). Let's call this element temp.
Add temp to the correct position of orderedList.
Remove temp from unorderedList.
Do steps 1-3 until unorderedList is empty.
This algorithm is called insertion sort. The Wikipedia article on insertion sort also contains an implementation in C. You could take a look at it to get an idea how you could do it in C#.
Insertion sort is a rather primitive sorting approach. Other approaches (such as quicksort, mergesort, etc.) may be faster, but are also harder to implement.

Find the most common property of a model using iterative approach and LINQ

I have a following Model class:
public class Student
{
public string Name { get; set; }
public string Major { get; set; }
public int Age { get; set; }
}
public string GetPrimaryMajor(List<Student> students)
{
...
}
How can I implement the method GetPrimaryMajor() to determine the most commonly occurring Major in the students parameter using iterative and LINQ approach?
Since this is clearly homework, I'll give you the easy/easier one and you can figure out the iterative approach from there.
public string GetPrimaryMajor(List<Student> students)
{
var mostCommonMajor = students
.GroupBy(m => m.Major)
.OrderByDescending(gp => gp.Count())
.FirstOrDefault()? // null-conditional operator
.Select(s => s.Major)
.FirstOrDefault();
return mostCommonMajor;
}
For the iterative approach, consider the following pseudo-code as one potential simple, iterative (potentially poor performing) algorithm:
// Iterate students
// Track count of each major
// Keep track of most commonly occurring major by comparing count of
// currently iterated value vs current most commonly occurring value count
// Return most commonly occurred major at end of loop.
An iterative approach using Dictionary. A lot of comments to explain it step inside.
The second half is probably especially inefficient, finding the max is certainly much easier using LINQ. But #DavidL has already provided an excellent LINQ answer, so thought I'd stay say and use no LINQ whatsoever.
public string GetPrimaryMajor(List<Student> students)
{
//Create a dictionary of string and int. These are our major names and and the count of students in thaat major respectively
Dictionary<string, int> MajorCounts = new Dictionary<string, int>();
//Iterate through all students
foreach (Student stu in students)
{
//Check if we have already found a student with that major
if (MajorCounts.ContainsKey(stu.Major))
{
//If yes add one to the count of students with that major
MajorCounts[stu.Major]++;
}
else
{
//If no create a key for that major, start at one to count the student we just found
MajorCounts.Add(stu.Major, 1);
}
}
//Now that we have our majors and number of students in each major we need to find the highest one
//Our first one starts as our highest found
string HighestMajor = MajorCounts.First().Key;
//iterate through all the majors
for (int i = 0; i < MajorCounts.Count(); i++)
{
//If we find a major with higher student count, replace our current highest major
if (MajorCounts.ElementAt(i).Value > MajorCounts[HighestMajor])
{
HighestMajor = MajorCounts.ElementAt(i).Key;
}
}
//Return the highet major
return HighestMajor;
}
Basically populate a dictionary by using the Major as a string key, and increasing the value int by one each time a Student has that major. Then a basic iteration through the dictionary to find the key with the highest value.

Linq progressive state based query

Firstly apologies for the poor title. Absolutely no idea how to describe this question!
I have a "Relationship" entity that defines a relationship between 2 users.
public class Relationship{
User User1{get;set;}
User User2{get;set;}
DateTime StateChangeDate {get;set;}
//RelationshipState is an Enum with int values
State RelationshipState State{get;set;}
}
Relationship state example.
public enum RelationshipState{
state1 = 1,
state2 = 2,
state3 = 3,
state4 = 4
}
A Relationship entity is created each time the RelationshipState changes. So for any pair of users, there will be many Relationship objects. With the most recent being current.
I'm trying to query for any Relationship object that represents a REDUCTION in RelationshipState for a particular pair of users.
So all the RelationshipObjects for all the users. That have a later Date than one with a higher RelationshipState.
I'm finding it very hard to figure out how to accomplish this without iterating over the entire Relationship table.
First, create a query to return all the combinations of users and a child that lists all the status changes. For more information, google LINQ Group By.
Then using your collection, filter out all the ones you don't want by looking at the last two status changes and seeing if it's gone down.
Here's an example, tested in LinqPad as a C# Program:
public enum RelationshipState {
state1 = 1,
state2 = 2,
state3 = 3,
state4 = 4
}
public class User {
public int id {get;set;}
}
public class Relationship{
public User User1{get;set;}
public User User2{get;set;}
public DateTime StateChangeDate {get;set;}
//RelationshipState is an Enum with int values
public RelationshipState State {get;set;}
}
void Main()
{
var rs=new List<Relationship>() {
new Relationship{ User1=new User{id=1},User2=new User{id=2},StateChangeDate=DateTime.Parse("1/1/2013"),State=RelationshipState.state2},
new Relationship{ User1=new User{id=1},User2=new User{id=2},StateChangeDate=DateTime.Parse("1/2/2013"),State=RelationshipState.state3},
new Relationship{ User1=new User{id=1},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/1/2013"),State=RelationshipState.state2},
new Relationship{ User1=new User{id=1},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/2/2013"),State=RelationshipState.state1},
new Relationship{ User1=new User{id=2},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/2/3013"),State=RelationshipState.state1}
};
var result=rs.GroupBy(cm=>new {id1=cm.User1.id,id2=cm.User2.id},(key,group)=>new {Key1=key,Group1=group.OrderByDescending(g=>g.StateChangeDate)})
.Where(r=>r.Group1.Count()>1) // Remove Entries with only 1 status
//.ToList() // This might be needed for Linq-to-Entities
.Where(r=>r.Group1.First().State<r.Group1.Skip(1).First().State) // Only keep relationships where the state has gone done
.Select(r=>r.Group1.First()) //Turn this back into Relationship objects
;
// Use this instead if you want to know if state ever had a higher state than it is currently
// var result=rs.GroupBy(cm=>new {id1=cm.User1.id,id2=cm.User2.id},(key,group)=>new {Key1=key,Group1=group.OrderByDescending(g=>g.StateChangeDate)})
// .Where(r=>r.Group1.First().State<r.Group1.Max(g=>g.State))
// .Select(r=>r.Group1.First())
// ;
result.Dump();
}
Create a stored procedure in the database that can use a cursor to iterate the items and pair them off with the item before them (and then filter to decreasing state.)
Barring that, you can perform an inner query that finds the previous value for each item:
from item in table
let previous =
(from innerItem in table
where previous.date < item.Date
select innerItem)
.Max(inner => inner.Date)
where previous.State > item.State
select item
As inefficient as that seems, It might be worth a try. Perhaps, with the proper indexes, and a good query optimizer (and a sufficiently small set of data) it won't be that bad. If it's unacceptably slow, then trying out a stored proc with a cursor is most likely going to be the best.

Multiple Fields Indexed Object Array

public class User
{
public int Id { get; set; }
public int Age { get; set; }
public string Name { get; set; }
}
I have 100k users.
Query: Get Users Whose Name is "Rafael" AND whose age is between 40 and 50
By Linq2Objects : users.Where(p=>p. Name=="Rafael" && p.Age>=40 && p.Age<=50).ToArray();
Is there any alternative implemantation with better performance? (Readonly Thread-Safe)
(MultiIndexed User Array)
I've tested it's performance. For 1000k users it takes 30-50 ms. It seems not important but it is.
Because I can get 50 requests in a second.
With dharnitski's solution. It takes 0ms. :)
But is there any code framework makes it transparently.
public class FastArray<T>
You cannot get result you want without full dataset scan if your data is not prepared.
Prepare data in advance when time is not critical and work with sorted data when you need short response time.
There is an analogy for this in database world.
There is a table with 100K records. Somebody wants to run a Select query with "where" clause that filter data by not primary key. It always will be slow "table scan" operation in execution plan unless index(es) is implemented.
Sample of code that implements indexing using ILookup<TKey, TValue>:
//not sorted array of users - raw data
User[] originalUsers;
//Prepare data in advance (create one index).
//Field with the best distribution should be used as key
ILookup<string, User> preparedUsers = originalUsers.ToLookup(u => u.Name, u => u);
//run this code when you need subset
//search by key is optimized by .NET class
//"where" clause works with small set of data
preparedUsers["Rafael"].Where(p=> p.Age>=40 && p.Age<=50).ToArray();
This code is not as powerful as database indexes (for example it does not support substrings) but it shows the idea.

Categories

Resources