Run Through a MongoDB collection in reverse mode

Run Through a MongoDB collection in reverse mode - c#

I have a collection "collection_Save" in mongoDB that contains documents
that are used to save the operations that occur on others documents in an other collection. They are listed by order of creation in the database.
In order to reverse those operations I need to run through the collection from the end to the start.
This is where I can't figure out how to do. Since MongoCollection doesn't have the equivalent of a "Reverse" method.
I tried to create an index using the following code
collection_Save.CreateIndex(IndexKeys<SaveMongo>.Ascending(_ => _._id));
but I can't figure out how to use it (or if it is really helpful in my case).
I did find something that might be useful : MongoRestore, skip n first documents
However they are not working in c# and my low reputation prevents me from commenting the post.
Do you know how to run through a collection in "reverse mode" ?

Just create an index on your date field like this:
db.collection_Save.createIndex({date: -1})
then you can query your collection in this way:
db.collection_Save.find().sort({date: -1}).skip(last_n).limit(1)
where last_n is the number (counted from the end) of the document you want to get.

Related

Getting a list of objects from mongodb with a filter

I am rather new to c# and MongoDb in particular. I have a method GetAllTickets() that returns a List<Ticket> with all entries of a MongoDb collection called "tickets". I want to be able to only get the entries with a certain filter from the collection in a different method called GetTicketsWithStatus(Status status) (where Status is an enum). How can I do this?
I have tried List<Ticket> tickets = collectionOfTickets.Find(x => x.ticketStatus == status).ToList<Ticket>(); (where collectionOfTickets is an IMongoCollection<Ticket> but that doesnt get any objects in the List. Any help will be appreciated.
P.S.: GetAllTickets() uses List<Ticket> tickets = collectionOfTickets.AsQueryable().ToList<Ticket>(); and it fills in the List properly. It's only when I try to use a filter that everything breaks and the list is not filled in.

I managed to solve this, if someone else has this problem in the future.
First of all, my Ticket model had enums in them, which at first I had (wrongly) saved as strings in mongoDb, then saved those strings in the model, then used methods to transform those strings into enums. I just turned all strings in mongodb into ints and saved the numerical position of the enum in the mongodb doc. Much easier to translate into enum, and makes the Linq expression above work, so it got solved.
Second of all, I tried using Builders to create a filter variable, but for some reason my Find() method would not accept that filter.
Overall, the problem lied in the way I had stored enum values in mongoDb, and how i was translating them on my model

LINQ Optimization for searching a if an object exist in a list within a list

Currently I have 7,000 video entries and I have a hard time optimizing it to search for Tags and Actress.
This is my code I am trying to modify, I tried using HashSet. It is my first time using it but I don't think I am doing it right.
Dictionary dictTag = JsonPairtoDictionary(tagsId,tagsName);
Dictionary dictActresss = JsonPairtoDictionary(actressId, actressName);
var listVid = new List<VideoItem>(db.VideoItems.ToList());
HashSet<VideoItem> lll = new HashSet<VideoItem>(listVid);
foreach (var tags in dictTag)
{
lll = new HashSet<VideoItem>(lll.Where(q => q.Tags.Exists(p => p.Id == tags.Key)));
}
foreach (var actress in dictActresss)
{
listVid = listVid.Where(q => q.Actress.Exists(p => p.Id == actress.Key)).ToList();
}
First part I get all the Videos in Db by using db.VideoItems.ToList()
Then it will go through a loop to check if a Tag exist
For each VideoItem it has a List<Tags> and I use 'exist' to check if a tag is match.
Then same thing with Actress.
I am not sure if its because I am in Debug mode and ApplicationInsight is active but it is slow. And I will get like 10-15 events per second with baseType:RemoteDependencyData which I am not sure if it means it still connected to database (should not be since I only should only be messing with the a new list of all videos) or what.
After 7 mins it is still processing and that's the longest time I have waited.
I am afraid to put this on my live site since this will eat up my resource like candy

Instead of optimizing the linq you should optimize your database query.
Databases are great at optimized searches and creating subsets and will most likely be faster than anything you write. If you have need to create a subset based on more than on database parameter I would recommend looking into creating some indexes and using those.
Edit:
Example of db query that would eliminate first for loop (which is actually multiple nested loops and where the time delay comes from):
select * from videos where tag in [list of tags]
Edit2
To make sure this is most efficient, require the database to index on the TAGS column. To create the index:
CREATE INDEX video_tags_idx ON videos (tag)
Use 'explains' to see if the index is being used automatically (it should be)
explain select * from videos where tag in [list of tags]
If it doesn't show your index as being used you can look up the syntax to force the use of it.

The problem was not optimization but it was utilization of the Microsoft SQL or my ApplicationDbContext.
I found this when I realize that http://www.albahari.com/nutshell/predicatebuilder.aspx
Because the problem with Keyword search, there can be multiple keywords, and the code I made above doesn't utilize the SQL which made the long execution time.
Using the predicate builder, it will be possible to create dynamic conditions in LINQ

How to use MongoDB as unique/enumeration store

This seems to be like a common use case... but somehow I cannot get it working.
I'm attempting to use MongoDB as an enumeration store with unique items. I've created a collection with a byte[] Id (the unique ID) and a timestamp (a long, used for enumeration). The store is quite big (terabytes) and distributed among different servers. I am able to re-build the store from scratch currently, since I'm still in the testing phase.
What I want to do is two things:
Create a unique id for each item that I insert. This basically means that if I insert the same ID twice, MongoDB will detect this and give an error. This approach seems to work fine.
Continuously enumerate the store for new items by other processes. The approach I took was to add a second index to InsertID and used a high precision timestamp on this along with the server id and a counter (just to make it unique and ascending).
In the best scenario this would mean that the enumerator would keep track of an index cursor for every server. From what I've learned from mongodb query processing I expected this behavior. However, when I try to execute the code (below) it seems to take forever to get anything.
long lastid = 0;
while (true)
{
DateTime first = DateTime.UtcNow;
foreach (var item in collection.FindAllAs<ContentItem>().OrderBy((a)=>(a.InsertId)).Take(100))
{
lastid = item.InsertId;
}
Console.WriteLine("Took {0:0.00} for 100", (DateTime.UtcNow - first).TotalSeconds);
}
I've read about cursors, but am unsure if they fulfill the requirements when new items are inserted into the store.
As I said, I'm not bound to any table structure or something like that... the only things that are important is that I can get new items over time and without getting duplicate items.
-Stefan.

Somehow I figured it out... more or less...
I created the query manually and ended up with something like this:
db.documents.find({ "InsertId" : { "$gt" : NumberLong("2020374866209304106") } }).limit(10).sort({ "InsertId" : 1 });
The LINQ query I put in the question doesn't generate this query. After some digging in the code I found that it should be this LINQ query:
foreach (var item in collection.AsQueryable().Where((a)=>(a.InsertId > lastid)).OrderBy((a) => (a.InsertId)).Take(100))
The AsQueryable() seems to be the key to execute the rewriting of LINQ to MongoDB queries.
This gives results, but still they appeared to be slow (4 secs for 10 results, 30 for 100). However, when I added 'explain()' I noticed '0 millis' in the query execution.
I stopped the process doing bulk inserts and tada, it works, and fast. In other words: the issues I was having were due to the locking behavior of MongoDB, and due to the way I interpreted the linq implementation. Since the former is the result of initial bulk-filling the data store, this means that the problem is solved.
On the 'negative' part of the solution: I would have preferred a solution that involved serializable cursors or something like that... this 'take' solution has to iterate the b-tree over and over again. If someone has an answer for this, please let me know.
-Stefan.

What is the fastest way to search a List<T> across multiple properties?

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

Fastest way to find objects from a collection matched by condition on string member

Suppose I have a collection (be it an array, generic List, or whatever is the fastest solution to this problem) of a certain class, let's call it ClassFoo:
class ClassFoo
{
public string word;
public float score;
//... etc ...
}
Assume there's going to be like 50.000 items in the collection, all in memory.
Now I want to obtain as fast as possible all the instances in the collection that obey a condition on its bar member, for example like this:
List<ClassFoo> result = new List<ClassFoo>();
foreach (ClassFoo cf in collection)
{
if (cf.word.StartsWith(query) || cf.word.EndsWith(query))
result.Add(cf);
}
How do I get the results as fast as possible? Should I consider some advanced indexing techniques and datastructures?
The application domain for this problem is an autocompleter, that gets a query and gives a collection of suggestions as a result. Assume that the condition doesn't get any more complex than this. Assume also that there's going to be a lot of searches.

With the constraint that the condition clause can be "anything", then you're limited to scanning the entire list and applying the condition.
If there are limitations on the condition clause, then you can look at organizing the data to more efficiently handle the queries.
For example, the code sample with the "byFirstLetter" dictionary doesn't help at all with an "endsWith" query.
So, it really comes down to what queries you want to do against that data.
In Databases, this problem is the burden of the "query optimizer". In a typical database, if you have a database with no indexes, obviously every query is going to be a table scan. As you add indexes to the table, the optimizer can use that data to make more sophisticated query plans to better get to the data. That's essentially the problem you're describing.
Once you have a more concrete subset of the types of queries then you can make a better decision as to what structure is best. Also, you need to consider the amount of data. If you have a list of 10 elements each less than 100 byte, a scan of everything may well be the fastest thing you can do since you have such a small amount of data. Obviously that doesn't scale to a 1M elements, but even clever access techniques carry a cost in setup, maintenance (like index maintenance), and memory.
EDIT, based on the comment
If it's an auto completer, if the data is static, then sort it and use a binary search. You're really not going to get faster than that.
If the data is dynamic, then store it in a balanced tree, and search that. That's effectively a binary search, and it lets you keep add the data randomly.
Anything else is some specialization on these concepts.

var Answers = myList.Where(item => item.bar.StartsWith(query) || item.bar.EndsWith(query));
that's the easiest in my opinion, should execute rather quickly.

Not sure I understand... All you can really do is optimize the rule, that's the part that needs to be fastest. You can't speed up the loop without just throwing more hardware at it.
You could parallelize if you have multiple cores or machines.

I'm not up on my Java right now, but I would think about the following things.
How you are creating your list? Perhaps you can create it already ordered in a way which cuts down on comparison time.
If you are just doing a straight loop through your collection, you won't see much difference between storing it as an array or as a linked list.
For storing the results, depending on how you are collecting them, the structure could make a difference (but assuming Java's generic structures are smart, it won't). As I said, I'm not up on my Java, but I assume that the generic linked list would keep a tail pointer. In this case, it wouldn't really make a difference. Someone with more knowledge of the underlying array vs linked list implementation and how it ends up looking in the byte code could probably tell you whether appending to a linked list with a tail pointer or inserting into an array is faster (my guess would be the array). On the other hand, you would need to know the size of your result set or sacrifice some storage space and make it as big as the whole collection you are iterating through if you wanted to use an array.
Optimizing your comparison query by figuring out which comparison is most likely to be true and doing that one first could also help. ie: If in general 10% of the time a member of the collection starts with your query, and 30% of the time a member ends with the query, you would want to do the end comparison first.

For your particular example, sorting the collection would help as you could binarychop to the first item that starts with query and terminate early when you reach the next one that doesn't; you could also produce a table of pointers to collection items sorted by the reverse of each string for the second clause.
In general, if you know the structure of the query in advance, you can sort your collection (or build several sorted indexes for your collection if there are multiple clauses) appropriately; if you do not, you will not be able to do better than linear search.

If it's something where you populate the list once and then do many lookups (thousands or more) then you could create some kind of lookup dictionary that maps starts with/ends with values to their actual values. That would be a fast lookup, but would use much more memory. If you aren't doing that many lookups or know you're going to be repopulating the list at least semi-frequently I'd go with the LINQ query that CQ suggested.

You can create some sort of index and it might get faster.
We can build a index like this:
Dictionary<char, List<ClassFoo>> indexByFirstLetter;
foreach (var cf in collection) {
indexByFirstLetter[cf.bar[0]] = indexByFirstLetter[cf.bar[0]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[0]].Add(cf);
indexByFirstLetter[cf.bar[cf.bar.length - 1]] = indexByFirstLetter[cf.bar[cf.bar.Length - 1]] ?? new List<ClassFoo>();
indexByFirstLetter[cf.bar[cf.bar.Length - 1]].Add(cf);
}
Then use the it like this:
foreach (ClasssFoo cf in indexByFirstLetter[query[0]]) {
if (cf.bar.StartsWith(query) || cf.bar.EndsWith(query))
result.Add(cf);
}
Now we possibly do not have to loop through as many ClassFoo as in your example, but then again we have to keep the index up to date. There is no guarantee that it is faster, but it is definately more complicated.

Depends. Are all your objects always going to be loaded in memory? Do you have a finite limit of objects that may be loaded? Will your queries have to consider objects that haven't been loaded yet?
If the collection will get large, I would definitely use an index.
In fact, if the collection can grow to an arbitrary size and you're not sure that you will be able to fit it all in memory, I'd look into an ORM, an in-memory database, or another embedded database. XPO from DevExpress for ORM or SQLite.Net for in-memory database comes to mind.
If you don't want to go this far, make a simple index consisting of the "bar" member references mapping to class references.

If the set of possible criteria is fixed and small, you can assign a bitmask to each element in the list. The size of the bitmask is the size of the set of the criteria. When you create an element/add it to the list, you check which criteria it satisfies and then set the corresponding bits in the bitmask of this element. Matching the elements from the list will be as easy as matching their bitmasks with the target bitmask. A more general method is the Bloom filter.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.