Multiple Fields Indexed Object Array

Multiple Fields Indexed Object Array - c#

public class User
{
public int Id { get; set; }
public int Age { get; set; }
public string Name { get; set; }
}
I have 100k users.
Query: Get Users Whose Name is "Rafael" AND whose age is between 40 and 50
By Linq2Objects : users.Where(p=>p. Name=="Rafael" && p.Age>=40 && p.Age<=50).ToArray();
Is there any alternative implemantation with better performance? (Readonly Thread-Safe)
(MultiIndexed User Array)
I've tested it's performance. For 1000k users it takes 30-50 ms. It seems not important but it is.
Because I can get 50 requests in a second.
With dharnitski's solution. It takes 0ms. :)
But is there any code framework makes it transparently.
public class FastArray<T>

You cannot get result you want without full dataset scan if your data is not prepared.
Prepare data in advance when time is not critical and work with sorted data when you need short response time.
There is an analogy for this in database world.
There is a table with 100K records. Somebody wants to run a Select query with "where" clause that filter data by not primary key. It always will be slow "table scan" operation in execution plan unless index(es) is implemented.
Sample of code that implements indexing using ILookup<TKey, TValue>:
//not sorted array of users - raw data
User[] originalUsers;
//Prepare data in advance (create one index).
//Field with the best distribution should be used as key
ILookup<string, User> preparedUsers = originalUsers.ToLookup(u => u.Name, u => u);
//run this code when you need subset
//search by key is optimized by .NET class
//"where" clause works with small set of data
preparedUsers["Rafael"].Where(p=> p.Age>=40 && p.Age<=50).ToArray();
This code is not as powerful as database indexes (for example it does not support substrings) but it shows the idea.

Related

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets

In the following code in each iteration of the loop I do the following things:
Fetch 5000 entities from DB
According to the logic in FilterRelationsToDelete I decide which entities to delete
I add the ids of entities to delete to a collection
After I finish the loop I delete from DB the entities according to idsToDelete collection.
I saw in visual studio "diagnostics tool" that the memory of the process is rising in the beginning of each loop iteration and after the iteration finishes it decreases by half, My problem is that sometimes it raises to 800MB and drops to 400MB, sometimes it is steady on 200MB, and sometimes it is over 1GB and drops to 500MB and stay steady on that.
I am not sure why my process memory is not steady on 200MB with small spikes when the data arrives from the DB. what might be the reasons for that? maybe Entity framework does not free all the memory it used? maybe the GC I activated here on purpose does not clean all the memory as I expected? maybe I have a bug here that I am not aware of?
The list of longs memory that I accumulate in idsToDelete is almost zero, this is not the problem.
Is there any way to write this code better?
private static void PlayWithMemory()
{
int buketSize = 5000;
List<long> idsToDelete = new List<long>();
for (int i = 0; i < 500; i++)
{
System.GC.Collect();//added just for this example
using (var context = new PayeeRelationsContext())
{
int toSkip = i * bucketSize;
List<PayeeRelation> dbPayeeRelations = GetDBRelations(context, toSkip, buketSize);
var relationsToDelete = FilterRelationsToDelete(dbPayeeRelations);
List<long> ids = relationsToDelete.Select(x => x.id).ToList();
idsToDelete.AddRange(ids);
Console.WriteLine($"i = {i}, toSkip = {toSkip}, payeeRelations.Count = {payeeRelationsIds.Count}");
}
}
}
private static List<PayeeRelation> GetDBRelations(PayeeRelationsContext context, int toSkip,
int bucketSize)
{
return context.PayeeRelations
.OrderBy(x => x.id)
.Include(x => x.PayeeRelation_PayeeVersion)
.Skip(toSkip)
.Take(bucketSize)
.AsNoTracking()
.ToList();
}

I don't see anything inherently wrong with your code to indicate a memory leak. I believe what you are observing is simply that the garbage collection does not fully "release" memory as soon as the references are deemed unused or out of scope.
If memory use/allocation is a concern then you should consider projecting down to the minimal viable data you need to validate in order to identify which IDs need to be deleted. For example, if you need the ID and Field1 from the PayeeRelations, then need Field2 and Field3 from the related PayeeVersion:
private class RelationValidationDetails
{
public long PayeeRelationId { get; set; }
public string Field1 { get; set; }
public string Field2 { get; set; }
public DateTime Field3 { get; set; }
}
....then in your query:
var validationData = context.PayeeRelations
.OrderBy(x => x.id)
.Select(x => new RelationValidationDetails
{
PayeeRelationId = x.Id,
Field1 = x.Field1,
Field2 = x.PayeeRelation_PayeeVersion.Field2,
Field3 = x.PayeeRelation_PayeeVersion.Field3
}).Skip(toSkip)
.Take(bucketSize)
.ToList();
Then your validation just takes the above collection of validation details to determine which IDs need to be deleted. (assuming it bases this decision on Fields 1-3) This ensures that your query only returns back exactly what data is needed to ultimately get the IDs to delete, minimizing memory growth.
There could be an argument that if later a "Field4" is required to do the validation would mean you have to update this object definition and revise the query which is extra work when you could just use the entity. However, Field4 might not come from PayeeRelations or the PayeeVersion, it might come from a different related entity which currently isn't eager loaded. This would introduce an overhead of having to add the cost of eager loading another table for every caller of that wrapped GetPayeeRelations call, whether they need that data or not. That, or risking performance hits from lazy loading (removing the AsNoTracking()) or introducing conditional complexity to tell the GetPayeeRelations which relationships need to be eager loaded. Trying to predict this possibility is really just an example of YAGNI.
I generally don't recommend hiding EF queries behind getter methods (such as generic repositories) simply because these tend to form a lowest-common denominator while chasing DNRY or SRP. The reality is that they end up being single points that are inefficient in many cases because if any one consumer needs a relationship eager loaded, all consumers get it eager loaded. It's generally far better to allow your consumers to have the ability to project down to just exactly what they need rather than worry that similar (rather than identical) queries might appear in multiple places.

Keeping track of long repetitive operation

In my code I need to perform an operation on lots of different postcodes that can take up to a few seconds each operation and can include thousand of postcodes that need to be processed. I need to be able to keep track of it and restart it if it is terminated for whatever reason.
I had the idea to create a search entity like so:
public class Search
{
public int ID { get; set; }
public virtual ICollection<PostCode> PostCodes { get; set; }
}
public class PostCode
{
public int ID { get; set; }
public string Value { get; set; }
}
I am having trouble working out how I keep track of it. My first thought was after each successful operation to remove the PostCode from the collection in Search and save it so each time I load the object it will have only the unprocessed postcodes but when I do this it throws an exception.
Like so:
using (var db = new MyDbContext())
{
foreach (var pc in search.PostCodes)
{
DoSomeStuff(pc);
search.PostCodes.Remove(pc);
db.SaveChanges();
}
}
I understand this is because I can't change the object I am enumerating over but for some reason I can't think of a simple non-convoluted way to keep track of the search in the database. Also due to the fact that there are many thousand postcodes I am concerned here about performance also.
Can anyone suggest how I should be keeping track of the search?
EDIT
So if I keep track of number processed in a completed property will this work properly?
using (var db = new MyDbContext())
{
foreach (var pc in search.PostCodes.Skip(search.Completed))
{
DoSomeStuff(pc);
search.Completed ++;
db.SaveChanges();
}
}
If I save and then load an entity with Entity Framework does it always keep the IEnumerable in the same order?
Edit #2
This is where in my actual code I load the search object:
rs = db.RadarSearches.Include("PostCodes").FirstOrDefault(x => x.Keyword.Value == keyword && x.Complete == false);
I guess if I enforce ordering here then I can make it consistent. How do I order the nested PostCodes collection?

store the result as an array and store that array in a file, so that you can resume (assuming the search result may vary at different times). with the file you need to save the index at where you have been. on resume of your process you load the array from the file and the last successful processed index.
you can store the whole object and remove successful processed ones, but for performance's sake you just save the file on every 5th or 10th procession.

Linq progressive state based query

Firstly apologies for the poor title. Absolutely no idea how to describe this question!
I have a "Relationship" entity that defines a relationship between 2 users.
public class Relationship{
User User1{get;set;}
User User2{get;set;}
DateTime StateChangeDate {get;set;}
//RelationshipState is an Enum with int values
State RelationshipState State{get;set;}
}
Relationship state example.
public enum RelationshipState{
state1 = 1,
state2 = 2,
state3 = 3,
state4 = 4
}
A Relationship entity is created each time the RelationshipState changes. So for any pair of users, there will be many Relationship objects. With the most recent being current.
I'm trying to query for any Relationship object that represents a REDUCTION in RelationshipState for a particular pair of users.
So all the RelationshipObjects for all the users. That have a later Date than one with a higher RelationshipState.
I'm finding it very hard to figure out how to accomplish this without iterating over the entire Relationship table.

First, create a query to return all the combinations of users and a child that lists all the status changes. For more information, google LINQ Group By.
Then using your collection, filter out all the ones you don't want by looking at the last two status changes and seeing if it's gone down.
Here's an example, tested in LinqPad as a C# Program:
public enum RelationshipState {
state1 = 1,
state2 = 2,
state3 = 3,
state4 = 4
}
public class User {
public int id {get;set;}
}
public class Relationship{
public User User1{get;set;}
public User User2{get;set;}
public DateTime StateChangeDate {get;set;}
//RelationshipState is an Enum with int values
public RelationshipState State {get;set;}
}
void Main()
{
var rs=new List<Relationship>() {
new Relationship{ User1=new User{id=1},User2=new User{id=2},StateChangeDate=DateTime.Parse("1/1/2013"),State=RelationshipState.state2},
new Relationship{ User1=new User{id=1},User2=new User{id=2},StateChangeDate=DateTime.Parse("1/2/2013"),State=RelationshipState.state3},
new Relationship{ User1=new User{id=1},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/1/2013"),State=RelationshipState.state2},
new Relationship{ User1=new User{id=1},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/2/2013"),State=RelationshipState.state1},
new Relationship{ User1=new User{id=2},User2=new User{id=3},StateChangeDate=DateTime.Parse("1/2/3013"),State=RelationshipState.state1}
};
var result=rs.GroupBy(cm=>new {id1=cm.User1.id,id2=cm.User2.id},(key,group)=>new {Key1=key,Group1=group.OrderByDescending(g=>g.StateChangeDate)})
.Where(r=>r.Group1.Count()>1) // Remove Entries with only 1 status
//.ToList() // This might be needed for Linq-to-Entities
.Where(r=>r.Group1.First().State<r.Group1.Skip(1).First().State) // Only keep relationships where the state has gone done
.Select(r=>r.Group1.First()) //Turn this back into Relationship objects
;
// Use this instead if you want to know if state ever had a higher state than it is currently
// var result=rs.GroupBy(cm=>new {id1=cm.User1.id,id2=cm.User2.id},(key,group)=>new {Key1=key,Group1=group.OrderByDescending(g=>g.StateChangeDate)})
// .Where(r=>r.Group1.First().State<r.Group1.Max(g=>g.State))
// .Select(r=>r.Group1.First())
// ;
result.Dump();
}

Create a stored procedure in the database that can use a cursor to iterate the items and pair them off with the item before them (and then filter to decreasing state.)
Barring that, you can perform an inner query that finds the previous value for each item:
from item in table
let previous =
(from innerItem in table
where previous.date < item.Date
select innerItem)
.Max(inner => inner.Date)
where previous.State > item.State
select item
As inefficient as that seems, It might be worth a try. Perhaps, with the proper indexes, and a good query optimizer (and a sufficiently small set of data) it won't be that bad. If it's unacceptably slow, then trying out a stored proc with a cursor is most likely going to be the best.

Create new number of List<T> out of one List<T> depending on T.variable

I'm pretty sure the title sounds kind of weird but I hope this is a valid question :)
I have a class, let's call it Employee:
class Employee
{
int employeeid { get; set; }
String employeename { get; set; }
String comment { get; set; }
}
I will fill a List from a database. An employeeid can have X number of comments, thus leaving the ratio 1:X. And there can of course be Y number of employeeid as well.
I want to create a List out of all the employee-objects which has for example employeeid = 1. And another list out of employeeid = 2.
I can sort the original List by employeeid, loop through the list and create a new list each time I hit a new employeeid. However I feel that the performance could be better.
Is there a way to split the original List into X number of lists depending on X number of distinct employeeids?

It's as simple as:
var query = data.GroupBy(employee => employee.employeeid);
Note the performance is much better than the algorithm you described. It will use a hash based data structure for the IDs, meaning that the entire operation is effectively a single pass performing a constant-time operation on each item.

Sure, LINQ's GroupBy should make this a breeze. Try something like that:
var answer = myEmployeeList.GroupBy( emp=>emp.employeeid );

System.Guid.NewGuid() in linq select

I wanted to generate a unique identifier for the results of a Linq query i did on some date.
Initially i thought of using Guid for that but stumbling upon this problem i had to improvise.
However I'd like to see if anyone could have a solution using Guid so here we go.
Imagine we have:
class Query
{
public class Entry
{
public string Id { get; set; }
public int Value { get; set; }
}
public static IEnumerable<Entry> GetEntries( IEnumerable<int> list)
{
var result =
from i in list
select new Entry
{
Id = System.Guid.NewGuid().ToString("N"),
Value = i
};
return result;
}
}
Now we want Id to be unique for each entry, but we need this value to be the same for each traversal of the IEnumerable we get from GetEntries. This means that we want calling the following code:
List<int> list = new List<int> { 1, 2, 3, 4, 5 };
IEnumerable<Query.Entry> entries = Query.GetEntries(list);
Console.WriteLine("first pass");
foreach (var e in entries) { Console.WriteLine("{0} {1}", e.Value, e.Id); }
Console.WriteLine("second pass");
foreach (var e in entries) { Console.WriteLine("{0} {1}", e.Value, e.Id); }
to give us something like:
first pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
second pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
However we get:
first pass
1 47f4a21a037c4ac98a336903ca9df15b
2 f339409bde22487e921e9063e016b717
3 8f41e0da06d84a58a61226a05e12e519
4 013cddf287da46cc919bab224eae9ee0
5 6df157da4e404b3a8309a55de8a95740
second pass
1 a9433568e75f4f209c688962ee4da577
2 2d643f4b58b946ba9d02b7ba81064274
3 2ffbcca569fb450b9a8a38872a9fce5f
4 04000e5dfad340c1887ede0119faa16b
5 73a11e06e087408fbe1909f509f08d03
Now taking a second look at my code above I realized where my error was:
The assignment of Id to Guid.NewGuid().ToString("N") gets called every time we traverse the collection and thus is different everytime.
So what should i do then?
Is there a way i can reassure that i will get with only one copy of the collection everytime?
Is there a way that i'm sure that i won't be getting the new instances of the result of the query?
Thank you for your time in advance :)

This is a inherent to all LINQ queries. Being repeatable is coincidental, not guaranteed.
You can solve it with a .ToList() , like:
IEnumerable<Query.Entry> entries = Query.GetEntries(list).ToList();
Or better, move the .ToList() inside GetEntries()

Perhaps you need to produce the list of entries once, and return the same list each time in GetEntries.
Edit:
Ah no, you get each time the different list! Well, then it depends on what you want to get. If you want to get the same Id for each specific Value, maybe in different lists, you need to cache Ids: you should have a Dictionary<int, Guid> where you'll store the already allocated GUIDs. If you want your GUIDs be unique for each source list, you would perhaps need to cache the input the return IEnumerables, and always check if this input list was already returned or not.
Edit:
If you don't want to share the same GUIDs for different runs of GetEntries, you should just "materialize" the query (replacing return result; with return result.ToList();, for example), as it was suggested in the comment to your question.
Otherwise the query will run each time you traverse your list. This is what is called lazy evaluation. The lazy evaluation is usually not a problem, but in your case it leads to recalculating the GUID each query run (i.e., each loop over the result sequence).

Any reason you have to use LINQ? The following seems to work for me:
public static IEnumerable<Entry> GetEntries(IEnumerable<int> list)
{
List<Entry> results = new List<Entry>();
foreach (int i in list)
{
results.Add(new Entry() { Id = Guid.NewGuid().ToString("N"), Value = i });
}
return results;
}

That's because of the way linq works. When you return just the linq query, it is executed every time you enumerate over it. Therefore, for each list item Guid.NewGuid will be executed as many times as you enumerate over the query.
Try adding an item to the list after you iterated once over the query and you will see, that when iterating a second time, the just added list item will be also in the result set. That's because the linq query holds an instance of your list and not an independent copy.
To get always the same result, return an array or list instead of the linq query, so change the return line of the GetEntries method to something like that:
return result.ToArray();
This forces immediate execution, which also happens only once.
Best Regards,
Oliver Hanappi

You might think not using Guid, at least not with "new".
Using GetHashCode() returns unique values that don't change when you traverse the list multiple times.
The problem is that your list is IEnumerable<int>, so the hash code of each item coincides with its value.
You should re-evaluate your approach and use a different strategy. One thing that comes into my mind is to use a pseudo-random number generator initialized with the hash code of the collection. It will return you always the same numbers as soon as it's initialized with the same value. But, again, forget Guid

One suggestion: (Don't know if that's your case or not though)
If you want to save the entries in database, Try to assign your entry's primary key a Guid at the database level. This way, each entry will have a unique and persisted Guid as its primary key. Checkout this link for more info.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Multiple Fields Indexed Object Array - c#

Related

Why the process memory increases when fetching a lot of data from DB via entity framework in buckets

Keeping track of long repetitive operation

Linq progressive state based query

Create new number of List<T> out of one List<T> depending on T.variable

System.Guid.NewGuid() in linq select

Categories

Resources