This seems to be like a common use case... but somehow I cannot get it working.
I'm attempting to use MongoDB as an enumeration store with unique items. I've created a collection with a byte[] Id (the unique ID) and a timestamp (a long, used for enumeration). The store is quite big (terabytes) and distributed among different servers. I am able to re-build the store from scratch currently, since I'm still in the testing phase.
What I want to do is two things:
Create a unique id for each item that I insert. This basically means that if I insert the same ID twice, MongoDB will detect this and give an error. This approach seems to work fine.
Continuously enumerate the store for new items by other processes. The approach I took was to add a second index to InsertID and used a high precision timestamp on this along with the server id and a counter (just to make it unique and ascending).
In the best scenario this would mean that the enumerator would keep track of an index cursor for every server. From what I've learned from mongodb query processing I expected this behavior. However, when I try to execute the code (below) it seems to take forever to get anything.
long lastid = 0;
while (true)
{
DateTime first = DateTime.UtcNow;
foreach (var item in collection.FindAllAs<ContentItem>().OrderBy((a)=>(a.InsertId)).Take(100))
{
lastid = item.InsertId;
}
Console.WriteLine("Took {0:0.00} for 100", (DateTime.UtcNow - first).TotalSeconds);
}
I've read about cursors, but am unsure if they fulfill the requirements when new items are inserted into the store.
As I said, I'm not bound to any table structure or something like that... the only things that are important is that I can get new items over time and without getting duplicate items.
-Stefan.
Somehow I figured it out... more or less...
I created the query manually and ended up with something like this:
db.documents.find({ "InsertId" : { "$gt" : NumberLong("2020374866209304106") } }).limit(10).sort({ "InsertId" : 1 });
The LINQ query I put in the question doesn't generate this query. After some digging in the code I found that it should be this LINQ query:
foreach (var item in collection.AsQueryable().Where((a)=>(a.InsertId > lastid)).OrderBy((a) => (a.InsertId)).Take(100))
The AsQueryable() seems to be the key to execute the rewriting of LINQ to MongoDB queries.
This gives results, but still they appeared to be slow (4 secs for 10 results, 30 for 100). However, when I added 'explain()' I noticed '0 millis' in the query execution.
I stopped the process doing bulk inserts and tada, it works, and fast. In other words: the issues I was having were due to the locking behavior of MongoDB, and due to the way I interpreted the linq implementation. Since the former is the result of initial bulk-filling the data store, this means that the problem is solved.
On the 'negative' part of the solution: I would have preferred a solution that involved serializable cursors or something like that... this 'take' solution has to iterate the b-tree over and over again. If someone has an answer for this, please let me know.
-Stefan.
Related
Currently I have 7,000 video entries and I have a hard time optimizing it to search for Tags and Actress.
This is my code I am trying to modify, I tried using HashSet. It is my first time using it but I don't think I am doing it right.
Dictionary dictTag = JsonPairtoDictionary(tagsId,tagsName);
Dictionary dictActresss = JsonPairtoDictionary(actressId, actressName);
var listVid = new List<VideoItem>(db.VideoItems.ToList());
HashSet<VideoItem> lll = new HashSet<VideoItem>(listVid);
foreach (var tags in dictTag)
{
lll = new HashSet<VideoItem>(lll.Where(q => q.Tags.Exists(p => p.Id == tags.Key)));
}
foreach (var actress in dictActresss)
{
listVid = listVid.Where(q => q.Actress.Exists(p => p.Id == actress.Key)).ToList();
}
First part I get all the Videos in Db by using db.VideoItems.ToList()
Then it will go through a loop to check if a Tag exist
For each VideoItem it has a List<Tags> and I use 'exist' to check if a tag is match.
Then same thing with Actress.
I am not sure if its because I am in Debug mode and ApplicationInsight is active but it is slow. And I will get like 10-15 events per second with baseType:RemoteDependencyData which I am not sure if it means it still connected to database (should not be since I only should only be messing with the a new list of all videos) or what.
After 7 mins it is still processing and that's the longest time I have waited.
I am afraid to put this on my live site since this will eat up my resource like candy
Instead of optimizing the linq you should optimize your database query.
Databases are great at optimized searches and creating subsets and will most likely be faster than anything you write. If you have need to create a subset based on more than on database parameter I would recommend looking into creating some indexes and using those.
Edit:
Example of db query that would eliminate first for loop (which is actually multiple nested loops and where the time delay comes from):
select * from videos where tag in [list of tags]
Edit2
To make sure this is most efficient, require the database to index on the TAGS column. To create the index:
CREATE INDEX video_tags_idx ON videos (tag)
Use 'explains' to see if the index is being used automatically (it should be)
explain select * from videos where tag in [list of tags]
If it doesn't show your index as being used you can look up the syntax to force the use of it.
The problem was not optimization but it was utilization of the Microsoft SQL or my ApplicationDbContext.
I found this when I realize that http://www.albahari.com/nutshell/predicatebuilder.aspx
Because the problem with Keyword search, there can be multiple keywords, and the code I made above doesn't utilize the SQL which made the long execution time.
Using the predicate builder, it will be possible to create dynamic conditions in LINQ
all.
I am developing an application that is tracking the changes to an objects properties. Each time an objects properties change, I create a new row in the table with the updated property values and an incremented revision.
I have a table that has a structure like the following:
Id (primary key, system generated)
UserFriendlyId (generated programmatically, it is the Id the user sees in the UI, it stays the same regardless of how many revisions an object goes through)
.... (misc properties)
Revision (int, incremented when an object properties are changed)
To get the maximum revision for each UserFriendlyId, I do the following:
var latestIdAndRev = context.Rows.GroupBy(r => r.UserFriendlyId).Select(latest => new { UserFriendlyId = latest.Key, Revision = latest.Max(r=>r.Revision)}).ToList();
Then in order to get a collection of the Row objects, I do the following:
var latestRevs = context.Rows.Where(r => latestIdAndRev.Contains( new {UserFriendlyId=r.UserFriendlyId, Revision=r.Revision})).ToList();
Even though, my table only has ~3K rows, the performance on the latestRevs statement is horrible (takes several minutes to finish, if it doesn't time out first).
Any idea on what I might do differently to get better performance retrieving the latest revision for a collection of userfriendlyids?
To increase the performance of you query you should try to make the entire query run on the database. You have divided the query into two parts and in the first query you pull all the revisions to the client side into latestIdAndRev. The second query .Where(r => latestIdAndRev.Contains( ... )) will then translate into a SQL statement that is something like WHERE ... IN and then a list of all the ID's that you are looking for.
You can combine the queries into a single query where you group by UserFriendlyId and then for each group select the row with the highest revision simply ordering the rows by Revision (descending) and picking the first row:
latestRevs = context.Rows.GroupBy(
r => r.UserFriendlyId,
(key, rows) => rows.OrderByDescending(r => r.Revision).First()
).ToList();
This should generate pretty efficient SQL even though I have not been able to verify this myself. To further increase performance you should have a look at indexing the UserFriendlyId and the Revision columns but your results may vary. In general adding an index increases the time it takes to insert a row but may decrease the time it takes to find a row.
(General advice: Watch out for .Where(row => clientSideCollectionOfIds.Contains(row.Id)) because all the ID's will have to be included in the query. This is not a fault of the ER mapper.)
There are a couple of things to look at, as you are likely ending up with serious recursion. If this is SQL Server, open profiler and start a profile on the database in question and then fire off the command. Look at what is being run, examine the execution plan, and see what is actually being run.
From this you MIGHT be able to use the index wizard to create a set of indexes that speeds things up. I say might, as the recursive nature of the query may not be easily solved.
If you want something that recurses to be wicked fast, invest in learning Window Functions. A few years back, we had a query that took up to 30 seconds reduced to milliseconds by heading that direction. NOTE: I am not stating this is your solution, just stating it is worth looking into if indexes alone do not meet your Service Level Agreements (SLAs).
I have a collection "collection_Save" in mongoDB that contains documents
that are used to save the operations that occur on others documents in an other collection. They are listed by order of creation in the database.
In order to reverse those operations I need to run through the collection from the end to the start.
This is where I can't figure out how to do. Since MongoCollection doesn't have the equivalent of a "Reverse" method.
I tried to create an index using the following code
collection_Save.CreateIndex(IndexKeys<SaveMongo>.Ascending(_ => _._id));
but I can't figure out how to use it (or if it is really helpful in my case).
I did find something that might be useful : MongoRestore, skip n first documents
However they are not working in c# and my low reputation prevents me from commenting the post.
Do you know how to run through a collection in "reverse mode" ?
Just create an index on your date field like this:
db.collection_Save.createIndex({date: -1})
then you can query your collection in this way:
db.collection_Save.find().sort({date: -1}).skip(last_n).limit(1)
where last_n is the number (counted from the end) of the document you want to get.
I have the following query:
if (idUO > 0)
{
query = query.Where(b => b.Product.Center.UO.Id == idUO);
}
else if (dependencyId > 0)
{
query = query.Where(b => b.DependencyId == dependencyId );
}
else
{
var dependencyIds = dependencies.Select(d => d.Id).ToList();
query = query.Where(b => dependencyIds.Contains(b.DependencyId.Value));
}
[...] <- Other filters...
if (specialDateId != 0)
{
query = query.Where(b => b.SpecialDateId == specialDateId);
}
So, I have other filters in this query, but at the end, I process the query in the database with:
return query.OrderBy(b => b.Date).Skip(20 * page).Take(20).ToList(); // the returned object is a Ticket object, that has 23 properties, 5 of them are relationships (FKs) and i fill 3 of these relationships with lazy loading
When I access the first page, its OK, the query takes less than one 1 second, but when I try to access the page 30000, the query takes more than 20 seconds. There is a way in the linq query, that I can improve the performance of the query? Or only in the database level? And in the database level, for this kind of query, which is the best way to improve the performance?
There is no much space here, imo, to make things better (at least looking on the code provided).
When you're trying to achieve a good performance on such numbers, I would recommend do not use LINQ at all, or at list use it on the stuff with smaler data access.
What you can do here, is introduce paging of that data on DataBase level, with some stored procedure, and invoke it from your C# code.
1- Create a view in DB which orders items by date including all related relationships, like Products etc.
2- Create a stored procedure querying this view with related parameters.
I would recommend that you pull up SQL Server Profiler, and run a profile on the server while you run the queries (both the fast and the slow).
Once you've done this, you can pull it into the Database Engine Tuning Advisor to get some tips about Indexes that you should add.. This has had great effect for me in the past. Of course, if you know what indexes you need, you can just add them without running the Advisor :)
I think you'll find that the bottleneck is occurring at the database. Here's why;
query.
You have your query, and the criteria. It goes to the database with a pretty ugly, but not too terrible select statement.
.OrderBy(b => b.Date)
Now you're ordering this giant recordset by date, which probably isn't a terrible hit because it's (hopefully) indexed on that field, but that does mean the entire set is going to be brought into memory and sorted before any skipping or taking occurs.
.Skip(20 * page).Take(20)
Ok, here's where it gets rough for the poor database. Entity is pretty awful at this sort of thing for large recordsets. I dare you to open sql profiler and view the random mess of sql it's sending over.
When you start skipping and taking, Entity usually sends queries that coerce the database into scanning the entire giant recordset until it finds what you are looking for. If that's the first ordered records in the recordset, say page 1, it might not take terribly long. By the time you're picking out page 30,000 it could be scanning a lot of data due to the way Entity has prepared your statement.
I highly recommend you take a look at the following link. I know it says 2005, but it's applicable to 2008 as well.
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
Once you've read that link, you might want to consider how you can create a stored procedure to accomplish what you're going for. It will be more lightweight, have cached execution plans, and is pretty well guaranteed to return the data much faster for you.
Barring that, if you want to stick with LINQ, read up on Compiled Queries and make sure you're setting MergeOption.NoTracking for read-only operations. You should also try returning an Object Query with explicit Joins instead of an IQueryable with deferred loading, especially if you're iterating through the results and joining to other tables. Deferred Loading can be a real performance killer.
I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.