We're planning to add a Redis-cache to an existing solution.
We have this core entity which is fetched a lot, several times per session. The entity consists of 13 columns where the majority is less than 20 characters. Typically it's retrieved by parent id, but sometimes as a subset that is fetched by a list of ids. To solve this we're thinking of implementing the solution below, but the question is if it's a good idea? Typically the list is around 400 items, but in some cases it could be up to 3000 items.
We would store the instances in the list with this key pattern: EntityName:{ParentId}:{ChildId}, where ParentId and ChildId is ints.
Then to retrieve the list based on ParentId we would call the below method with EntityName:{ParentId}:* as the value of the pattern-argument:
public async Task<List<T>> GetMatches<T>(string pattern)
{
var keys = _multiPlexer.GetServer(_multiPlexer.GetEndPoints(true)[0]).Keys(pattern: pattern).ToArray();
var values = await Db.StringGetAsync(keys: keys);
var result = new List<T>();
foreach (var value in values.Where(x => x.HasValue))
{
result.Add(JsonSerializer.Deserialize<T>(value));
}
return result;
}
And to retrieve a specific list of items we would call the below method with a list of exact keys:
public async Task<List<T>> GetList<T>(string[] keys)
{
var values = await Db.StringGetAsync(keys: keys.Select(x => (RedisKey)x).ToArray());
var result = new List<T>();
foreach (var value in values.Where(x => x.HasValue))
{
result.Add(JsonSerializer.Deserialize<T>(value));
}
return result;
}
The obvious worry here is the amount of objects to deserialize and the performance of System.Text.Json.
A alternative to this would be to store the data twice, both as a list and on it's own, but that would only help in the case where we're fetching by ParentId. We could also only store the data as a list and retrieve it every time only to sometimes use a subset.
Is there a better way to tackle this?
All input is greatly appreciated! Thanks!
Edit
I wrote a small console application to load test the alternatives, fetching 2000 items 100 times took 2020ms with the pattern matching and fetching the list took 1568ms. I think we can live with that difference and go with the pattern matching.
It seems like #Xerillio was right. I did some load testing using hosted services and then it was almost three times slower to fetch the list using the pattern matching, slower then receiving the list directly from SQL. So, to answer my own question if it's a good idea, I would say no it isn't. The majority of the added time was not because of deserialization rather because of fetching the keys using the pattern matching.
Here's the result from fetching 2000 items 100 items in a loop:
Fetch directly from db = 8625ms
Fetch using list of exact keys = 5663ms
Fetch using match = 13098ms
Fetch full list = 5352ms
Related
I have documents with huge count of fields (7500 fields in each)
but fields values is simple data (numbers only), when i try to query Collection it works great (i look at mongo profiler and it use indexes correctly)
but it takes too long time to iterate cursor (to receive data)
count of resulting documents is ~450 but it takes about ~2 minutes tu receive all documents
i already updated mongoDB version to last one, also updated MongoDB driver (for .NET), recreated indexes but nothing helps
P.S connection is not slow (BD server is in my local network - 100Base-T/Fast Ethernet)
Query Code example is below
var builder = Builders<BsonDocument>.Filter;
var filter = builder.Eq("OrgID", orgID);
filter = filter & builder.Eq("DateDeleted", (DateTime?)null);
var collection = GetCollection("NameOfCollection");
var result = collection.Find(filter);
using (var cursor = result.ToCursor())
{
while (cursor.MoveNext())
{
var batch = cursor.Current;
foreach (var document in batch)
{
yield return document;
}
}
}
and i have index for that fields separately and also there is composite index with both fields in one index
and it works great with collections where is much more documents but less fields (~20 fields in each document)
Why are you using a cursor? It is my understanding that fetches each record individually. I bet if you iterated through the ToList() instead you’d get better performance because it’d fetch all the data in a single call.
Foreach(var batch in collection.Find(filter).ToList()){
// your other code here
}
Also you are yielding the result which means that this is nested in an IEnumerable and whatever you are doing in between these retrieve calls can be slowing down the process, but you left that code out so it’s hard to say.
This one is all about performance. I have two major lists of objects (here, I'll use PEOPLE/PERSON as the stand-in). First, I need to filter one list by the First_Name property - then I need to create two filtered lists from each master list based on shared date - one list with only one name, the other list with every name, but with both lists only containing matching date entries (no date in one list that doesn't exist in the other). I've written a pseudo-code to simplify the issue to the core question below. Please understand while reading that BIRTHDAY wasn't the best choice, as there are multiple date entries per person. So please pretend that each person has about 5,000 "birthdays" when reading the code below:
public class Person
{
public string first_Name;
public string last_Name;
public DateTime birthday;
}
public class filter_People
{
List<Person> Group_1 = new List<Person>();// filled from DB Table "1982 Graduates" Group_1 contains all names and all dates
List<Person> Group_2 = new List<Person>();// filled from DB Table "1983 Graduates" Group_2 contains all names and all dates
public void filter(List<Person> group_One, List<Person> group_Two)
{
Group_1 = group_One;
Group_2 = group_Two;
//create a list of distinct first names from Group_1
List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
//Compare each first name in Group_1 to EVERY first name in Group 2, using only records with matching birthdays
Parallel.For(0, distinct_Group_1_Name.Count, dI => {
//Step 1 - create a list of person out of group_1 that match the first name being iterated
List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
//first_Name_List_1 now contains a list of everyone named X (Tom). We need to find people from group 2 who match Tom's birthday - regardless of name
//step 2 - find matching birthdays by JOINing the filtered name list against Group_2
DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
//Step 3 - create filtered lists where Filtered_Group_1 contains ONLY people named Tom, and Filtered_Group_2 contains people with ANY name sharing Tom's birthday. No duplicates, no missing dates.
List<Person> Filtered_Group_1 = first_Name_List_1.Where(p => p.birthday.In(merged_Dates)).ToList();
List<Person> Filtered_Group_2 = Group_2.Where(p => p.birthday.In(merged_Dates)).ToList();
//Step 4 -- move on adn process the two filtered lists (outside scope of question)
//each name in Group_1 will then be compared to EVERY name in Group_2 sharing the same birthday
//compare_Groups(Filtered_Group_1,Filtered_Group_2)
});
}
}
public static class Extension
{
public static bool In<T>(this T source, params T[] list)
{
return list.Contains(source);
}
}
Here, the idea is to take two different master name lists from the DB and create sub-lists where dates match (one with only one name, and the other with all names) allowing for a one-to-many comparison based on datasets of the same length with matching date indices. Originally, the idea was to simply load the lists from the DB, but the lists are long and loading all name data and using SELECT/WHERE/JOIN is much faster. I say "much faster" but that's relative.
I've tried converting Group_1 and Group_2 to Dictionaries and matching dates by using keys. Not much improvement. Group_1 has about 12Million records (about 4800 distinct names with multiple dates each), and Group_2 has about the same, so the input here is 12Million records and the output is a bazillion records. Even though I'm running this method as a separate Task and queuing the results for another thread to process, it's taking forever to split these lists and keep up.
Also, I realize this code doesn't make much sense using class Person, but it's only a representative of the problem essentially using pseudocode. In reality, this method sorts multiple datasets on date and compares one to many for correlation.
Any help on how to accomplish filtering this one to many comparison in a more productive way would be greatly appreciated.
Thanks!
Code in the current format, I see way too many issues for it to become performance oriented with the kind of data you have mentioned. Parallelism is no magic pill for poor algorithm and data structure choice.
Currently for every comparison it goes for linear search O(N), thus making it M*O(N) for M operations, even if we make these operations O(logN), even better O(1), there would be a drastic improvement in the execution time.
Instead of taking Distinct and then searching in the Parallel loop using Where clause, use GroupBy to aggregate / group the records, and create a Dictionary in the same operation, which would ensure the easy search of records with a given name
var nameGroupList = Group_1.GroupBy(p => p.first_Name).ToDictionary(p => p.Key, p => p);
This will help you get rid of following two operations in the original code (one of them in Parallel is a repetitive operation, which hurts the performance big time)
List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
The Dictionary will be of type Dictionary<string,IEnumerable<Person>>, thus you get the List of Person by name in O(1) time and there's no repetitive search. Another issue of the code that this would handle is recreation of list and as it searches through the original list / data.
Next part that needs to be handled, which is hurting the performance is the code like this
p.birthday.In(merged_Dates)
since in the extension methods you run the list.Contains, as an O(N) operation every time, which kills the performance. Following are the possible options:
Take the following operation too out of the Parallel loop:
DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
Instead create another Dictionary of type Dictionary<string, Hashset<DateTime>>, by intersecting the data from Dictionary<string,IEnumerable<Person>> created earlier, using a Data from Group2, you can use the appropriate IEqualityComparer for DateTime and thus a ready reckoner for Date list / array would be available and needn't be created everytime:
personDictionary["PersonCode"].Intersect(Group2,IEqualityComparer(using Date))
For the final result please notice, you shall store the result as HashSet instead of List. The benefit would be the Contains would be O(log(N)) operation instead of O(N), thus making it much faster. In fact it is also fine to have the structure like Dictionary<string, Dictionary<DateTime,DateTime>>, which will make it O(1) operation.
Try these points and suggest if there's any improvement in the working of the code.
I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).
The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.
I started with a list and identified the appropriate token object to update using a Linq query.
public class Model {
public List<State> States { get; set; }
...
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.
Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).
So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).
So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)
If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?
To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).
A wider view of the code is...
var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
var tokens = tokeniser.Tokenise(contents); // Iterator
foreach (var token in tokens) {
var stateText = StateToString(ref state);
var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
// ... update the match as appropriate for the token
}
var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();
If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.
Possible data structures that meet your needs:
Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.
Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.
30k is not that much so if state is unique you can do something like this.
Dictionary access is much faster.
var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);
Quoting MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
You can find more info about Dictionaries here.
Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:
var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);
I have the following code. The dictionary "_ItemsDict" contains millions of records. this code takes so much of time to add items to associatedItemslst LIST. Is there a way to speed up this process.
foreach (var obj in lst)
{
foreach (var item in _ItemsDict.Where(ikey => ikey.Key.StartsWith(obj))
.Select(ikey => ikey.Value))
{
aI = new AssociatedItem
{
associatedItemCode = artikel.ItemCode
};
associatedItemslst.Add(aI);
}
}
Instead of using a Dictionary<TKey, TValue> you may want to implement a Trie/Radix Tree/Prefix Tree.
Quoted from wikipedia:
A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile telephone.
(snip)
Tries are also well suited for implementing approximate matching algorithms,[6] including those used in spell checking and hyphenation[2] software.
You can divide by a factor 5 or 6 the time by using Parallel.Foreach()
String obj = "42";
Parallel.ForEach(_ItemsDict, new ParallelOptions{ MaxDegreeOfParallelism = Environment.ProcessorCount},
(i) =>
{
if (i.Key.StartsWith(obj))
bag.Add(new AssociatedItem() { associatedItemCode = i.Value });
});
But it seems there's definitely an architectural issue. Trie is one way to go. Or you can use a
Dictionary<String,List<TValue>>
where you store all occurrence of each part of each String, and then reference associated objects.
Last but not least, if your data comes from a database, SQL server is very efficient at searching part of varchar with a clause as :
WHERE ValueColumn like '42%' (equivalent of StartsWith("42") )
I do not think using dictionary is helping you to make this code fater, the reason is dictionary are good for matching the complete key not the partial key , in your case it actually going through each key in the dictionary and finding the result. I would suggest you to use some other data structure to get the result faster , one of them s TRIE data structure. I have posted a blog here for auto complete using TRIE https://devesh4blog.wordpress.com/2013/11/16/real-time-auto-complete-using-trie-in-c/
I'm having a problem finding the answer to this question, I have only found 1 other question here PagedList with Entity Framework getting all records but it never received a reply and I have looked through here https://stackoverflow.com/search?q=pagedlist.
So my question is the same, does pagedlist return all the records, then skip and take the required number as the default, so for example if the database has 1000 records it will return all records then take say the 1st 10 for page 1 etc.
From my own debugging, it does appear that way, but I'm looking for some clarification.
Thanks
George
-----------------Extra Code -------------------
Hi Maarten
Below is how I have my paging set up:
var model = new DisplayMemberForumRepliesViewModel
{
DisplayMemberForumReplyDetails = _imf.RepliesToForumPost(postId).ToPagedList(page, _numberOfRecordsPerPage)
};
View Model
public class DisplayMemberForumRepliesViewModel
{
public IPagedList<MembersForumProperties> DisplayMemberForumReplyDetails { get; set; }
public IEnumerable<MembersForumProperties> SelectForumPostReplies { get; set; }
}
As mentioned earlier, it seems to return all records, then selects the records paged.
Can you see what I'm doing wrong, I'm getting the data from an sql stored procedure which I have added below.
SELECT a.[MemberUsername] AS ForumMember,
a.[MemberID] AS ForumMemberID,
a.[MemberAvatarLocation] AS ForumMemberAvatar,
b.[ForumPostID] AS ForumPostID,
-- b.[ForumPostReplyID] AS ForumPostReplyID,
b.[ForumPostReplyMessage] AS ForumReplyMessage,
b.[ForumPostReplyDateTime] AS ForumRelyDateTimePosted,
b.[ForumPostReplyMessage] AS ForumPostReply,
c.[ForumPostTitle] AS ForumPostTitle
FROM [WebsiteMembership].[dbo].[tblMemberProfile] a
INNER JOIN [Website].[dbo].[tblForumMembersPostReplies] b ON a.[MemberID]=b.[ForumPostReplyMemberID]
INNER JOIN [Website].[dbo].[tblForumMembersPost] c ON a.[MemberID]=c.[ForumMemberID]
WHERE b.[ForumPostID] = #ForumPostID
ORDER BY b.[ForumReplyTableID] DESC
Thanks
Old question but I wanted to answer it anyway.
Paged List will query the amount of rows you specified in pageSize parameter. The thing is you didn't provide the code implementation for _imf.RepliesToForumPost(postId).
I'd guess your code inside RepliesToForumPost is getting all replies for that said post, then doing pagination with PagedList after it already got all data from database, which is not what you would want.
Just to clarify, I'm gonna provide two possible implementations as example.
This first one will do a deferred execution of the query (it'll hit database and get all records before paged list):
public IEnumerable<MembersForumProperties> RepliesToForumPost(int postId)
{
using (MyContext db = new MyContext())
{
return db.MembersForum.Where(c => c.PostId == postId).ToList();
}
}
Now, this second one just get the IQueryable<> to serve as a query and paged list is the one which will execute the query to the number of records you wanted.
public IQueryable<MembersForumProperties> RepliesToForumPost(int postId)
{
using (MyContext db = new MyContext())
{
return db.MembersForum.Where(c => c.PostId == postId).AsQueryable();
}
}
Also, you should take note PagedList will always execute two methods:
One call to Count() of your IEnumerable to recognize the total records ( whether your IEnumerable is a materialized list of objects or just the IQueryable expression representation)
Another one to get the desired records (again, whether is from a materialized list or not)
This means if you're working, for example, with Entity Framework and using the non-materialized IEnumerable (or, the IQueryable as my second implementation), Paged List call to Count() on the IEnumerable will make Entity Framework execute a query to retrieve the total number of rows, which could be 1000 (it's a simple count in database, it wouldn't get 1000 records at all), and a second call to retrieve your actual records/objects, in respect to pageSize you wished (for example, 10 records).
If I had a materialized list which 1000 objects because my code called ToList() as my first example, PagedList would still recognize the IEnumerable (a list) had 1000 items, and would paginate your 10 records, without executing any queries to database (since your list would be already materialized). Though, your method had returned 1000 rows from database anyway, which is worse than doing two queries (one to get 10 records and a count to recognize the query would return 1000).
No, EF doesn't retrieve all records if correctly queried. Since you don not show the specific implementation of your ToPagedList method, we can't tell.
Check my answer on your mentioned question.