MongoDB: Combine Aggregation and Filter

MongoDB: Combine Aggregation and Filter - c#

Please see the following post for some background: MongoDB C# Driver - Return last modified rows only
After almost two years of running this code, we've been experiencing performance problems lately and as much as I keep on saying that the code is not the issue, Infrastructure are insisting it's because I'm doing full table scans.
The thing is that the problem is environment specific. Our QA environment runs like a dream all the time but Dev and Prod are very slow at times and fine at other - it's very erratic. They have the same data and code on but Dev and Prod have another app that is also running on the database.
My data has an Id as well as an _id (or AuditId) - I group the data by Id and then return the last _id for that record where it was not deleted. We have multiple historic records for the same ID and I would like to return the last one (see original post).
So I have the following method:
private static FilterDefinition<T> ForLastAuditIds<T>(IMongoCollection<T> collection) where T : Auditable, IMongoAuditable
{
var pipeline = new[] { new BsonDocument { { "$group", new BsonDocument { { "_id", "$Id" }, { "LastAuditId", new BsonDocument { { "$max", "$_id" } } } } } } };
var lastAuditIds = collection.Aggregate<Audit>(pipeline).ToListAsync().Result.ToList().Select(_ => _.LastAuditId);
var forLastAuditIds = Builders<T>.Filter.Where(_ => lastAuditIds.Contains(_.AuditId) && _.Status != "DELETE");
return forLastAuditIds;
}
This method is called by the one below, which accepts an Expression that it appends to the FilterDefinition created by ForLastAuditIds.
protected List<T> GetLatest<T>(IMongoCollection<T> collection,
Expression<Func<T, bool>> filter, ProjectionDefinition<T, T> projection = null,
bool disableRoleCheck = false) where T : Auditable, IMongoAuditable
{
var forLastAuditIds = ForLastAuditIds(collection);
var limitedList = (
projection != null
? collection.Find(forLastAuditIds & filter, new FindOptions()).Project(projection)
: collection.Find(forLastAuditIds & filter, new FindOptions())
).ToListAsync().Result.ToList();
return limitedList;
}
Now, all of this works really well and is re-used by all of my code that calls Collections, but this specific collection is a lot bigger than the others and we are getting slowdowns just on that one.
My question is: Is there a way for me to take the aggregate and Filter Builder and combine them to return a single FilterDefinition that I could use without running the full table scan first?
I really hope I am making sense.

Assuming I fully understand what you want, this should be as easy as this:
First, put a descending index on the LastAuditId field:
db.collection.createIndex{ "LastAuditId": -1 /* for sorting */ }
Or even extend the index to cover for other fields that you have in your filter:
db.collection.createIndex{ "Status": 1, "LastAuditId": -1 /* for sorting */ }
Make sure, however, that you understand how indexes can/cannot support certain queries. And always use explain() to see what's really going on.
The next step is to realize that you must always filter as much as possible as the very first step to reduce the amount of sorting required.
So, if you need to e.g. filter by Name then by all means do it as the very first step if your business requirements permit it. Be careful, however, that filtering at the start changes your semantics in the sense that you will get the last modified documents per each Id that passed the preceeding $match stage as opposed to the last documents per each Id that happen to also pass the following $match stage.
Anyway, most importantly, once you've got a sorted set, you can easily and quickly get the latest full document by using $group with $first which - with the right index in place - will not do a collection scan anymore (it'll be an index scan for now and hence way faster).
Finally, you want to run the equivalent of the following MongoDB query through C# leveraging the $$ROOT variable in order to avoid a second query (I can put the required code together for you once you post your Audit, Auditable and IMongoAuditable types as well as any potential serializers/conventions):
db.getCollection('collection').aggregate({
$match: {
/* some criteria that you currently get in the "Expression<Func<BsonDocument, bool>> filter" */
}
}, {
$sort: {
"ModifiedDate": -1 // this will use the index!
}
}, {
$group: {
"_id": "$Id",
"document": { $first: "$$ROOT" } // no need to do a separate subsequent query or a $max/$min across the entire group because we're sorted!
}
}, {
$match: { // some additional filtering depending on your needs
"document.Status": { $ne: "Delete" }
}
})
Lastly, kindly note that it might be a good idea to move to the latest version of MongoDB because they are currently putting a lot of effort into optimizing aggregation cases like yours, e.g. this one: https://jira.mongodb.org/browse/SERVER-9507

Related

Iterate over 2 nested lists

I am writing a weather app and need to go through 2 nested loops. For the return value I want to iterate over the first list, looking at the corresponding second list data. When the data in the second list matches the bool, I need to get data from the corresponding first list. Now I think that my code works... but would like to ask if this is a good way to do this. I am also not sure if this LINQ query will work in general, with even more nested lists. Here's my approach in LINQ:
public static async Task<string> UpdateWeather(string lat, string lon)
{
WeatherObject weather = await WeatherAPI.GetWeatherAsync(lat, lon);
var first = (from l in weather.list
from w in l.weather
where w.id == 800
select l.city.name).First();
return first;
}

Your code is OK, it is a LINQ query.But one more thing. Use FirstOrDefault() instead of First(). First() will throw an exception if no matched element is found, but FirstOrDefault() will return the element or the default value.
You can also write in LINQ Method syntax if you prefer this.
public static async Task<string> UpdateWeather(string lat, string lon)
{
WeatherObject weather = await WeatherAPI.GetWeatherAsync(lat, lon);
var first = weather.list.Where(l => l.weather.Any(w => w.id == 800))
.Select(l => l.city.name)
.FirstOrDefault();
return first;
}

I believe your query should work, and it should generally work with more nested lists following a similar structure. As to if it is a good way to do this - it depends on the data structure and any data constraints.
For example, if two elements in weather.list have an element in their nested weather list might have the same id, then your code will only return the first one - which may not be correct.
e.g. in json:
[
{
city : {
name : "Chicago"
},
weather : [
{
id = 799
},
{
id = 800
}
]
},
{
city : {
name : "New York"
},
weather : [
{
id = 800
},
{
id = 801
}
]
}
}
For this dataset, your code will return "Chicago", but "New York" also matches. This may not be possible with the data API you are accessing, but given that there are no data constraints to ensure exclusivity of the nested lists, you might want to defensively check that there is only 0 or 1 elements in the returned list that match the expected criteria.
Another suggestion
On another note, not strictly an answer to your question - if you think your code will work but aren't sure, write a unit test. In this case, you'd wrap the call to WeatherAPI in a class that implements an interface you define. Update your method to call the method on a reference to the interface.
For your real application, ensure that an instance of the wrapper/proxy class is set on the reference.
For the unit test, use a framework like Moq to create a mock implementation of the interface that returns a known set of data and use that instead. You can then define a suite of unit tests that use mocks that return different data structures and ensure your code works under all expected structures.
This will be a lot easier if your class is not a static method as well, and if you can use dependency injection (Ninject, Autofac or one of many others...) to manage injecting the appropriate implementation of the service.
Further explanations of unit testing, dependency injection and mocking will take more than I can write in this answer, but I recommend reading up on it - you'll never find yourself thinking "I think this code works" again!

Dynamic Linq Predicate throws "Unsupported Filter" error with C# MongoDB Driver

I have been trying to pass in a dynamic list of Expressions to a MongoDB C# Driver query using Linq ... This method works for me with regular Linq queries against an ORM, for example, but results in an error when applied to a MongoDB query ... (FYI: I am also using LinqKit's PredicateBuilder)
//
// I create a List of Expressions which I can then add individual predicates to on an
// "as-needed" basis.
var filters = new List<Expression<Func<Session, Boolean>>>();
//
// If the Region DropDownList returns a value then add an expression to match it.
// (the WebFormsService is a home built service for extracting data from the various
// WebForms Server Controls... in case you're wondering how it fits in)
if (!String.IsNullOrEmpty(WebFormsService.GetControlValueAsString(this.ddlRegion)))
{
String region = WebFormsService.GetControlValueAsString(this.ddlRegion).ToLower();
filters.Add(e => e.Region.ToLower() == region);
}
//
// If the StartDate has been specified then add an expression to match it.
if (this.StartDate.HasValue)
{
Int64 startTicks = this.StartDate.Value.Ticks;
filters.Add(e => e.StartTimestampTicks >= startTicks);
}
//
// If the EndDate has been specified then add an expression to match it.
if (this.EndDate.HasValue)
{
Int64 endTicks = this.EndDate.Value.Ticks;
filters.Add(e => e.StartTimestampTicks <= endTicks);
}
//
// Pass the Expression list to the method that executes the query
var data = SessionMsgsDbSvc.GetSessionMsgs(filters);
The GetSessionMsgs() method is defined in a Data services class ...
public class SessionMsgsDbSvc
{
public static List<LocationOwnerSessions> GetSessionMsgs(List<Expression<Func<Session, Boolean>>> values)
{
//
// Using the LinqKit PredicateBuilder I simply add the provided expressions
// into a single "AND" expression ...
var predicate = PredicateBuilder.True<Session>();
foreach (var value in values)
{
predicate = predicate.And(value);
}
//
// ... and apply it as I would to any Linq query, in the Where clause.
// Additionally, using the Select clause I project the results into a
// pre-defined data transfer object (DTO) and only the DISTINCT DTOs are returned
var query = ApplCoreMsgDbCtx.Sessions.AsQueryable()
.Where(predicate)
.Select(e => new LocationOwnerSessions
{
AssetNumber = e.AssetNumber,
Owner = e.LocationOwner,
Region = e.Region
})
.Distinct();
var data = query.ToList();
return data;
}
}
Using the LinqKit PredicateBuilder I simply add the provided expressions into a single "AND" expression ... and apply it as I would to any Linq query, in the Where() clause. Additionally, using the Select() clause I project the results into a pre-defined data transfer object (DTO) and only the DISTINCT DTOs are returned.
This technique typically works when I an going against my Telerik ORM Context Entity collections ... but when I run this against the Mongo Document Collection I get the following error ...
Unsupported filter: Invoke(e => (e.Region.ToLower() == "central"),
{document})
There is certainly something going on beneath the covers that I am unclear on. In the C# Driver for MongoDB documentation I found the following NOTE ...
"When projecting scalars, the driver will wrap the scalar into a
document with a generated field name because MongoDB requires that
output from an aggregation pipeline be documents"
But honestly I am not sure what that neccessarily means or if it's related to this problem or not. The appearence of "{document}" in the error suggests that it might be relevant though.
Any additional thoughts or insight would be greatly appreciated though. Been stuck on this for the better part of 2 days now ...
I did find this post but so far am not sure how the accepted solution is much different than what I have done.

I'm coming back to revisit this after 4 years because while my original supposition did work it worked the wrong way which was it was pulling back all the records from Mongo and then filtering them in memory and to compound matters it was making a synchronous call into the database which is always a bad idea.
The magic happens in LinqKit's expand extension method
That flattens the invocation expression tree into something the Mongo driver can understand and thus act upon.
.Where(predicate.Expand())

How to determine if $addToSet actually added a new item into a MongoDB document or if the item already existed?

I'm using the C# driver (v1.8.3 from NuGet), and having a hard time determining if an $addtoSet/upsert operation actually added a NEW item into the given array, or if the item was already existing.
Adding a new item could fall into two cases, either the document didn't exist at all and was just created by the upsert, or the document existed but the array didn't exist or didn't contain the given item.
The reason I need to do this, is that I have large sets of data to load into MongoDB, which may (shouldn't, but may) break during processing. If this happens, I need to be able to start back up from the beginning without doing duplicate downstream processing (keep processing idempotent). In my flow, if an item is determined to be newly added, I queue up downstream processing of that given item, if it is determined to already have been added in the doc, then no more downstream work is required. My issue is that the result always returns saying that the call modified one document, even if the item was already existing in the array and nothing was actually modified.
Based on my understanding of the C# driver api, I should be able to make the call with WriteConcern.Acknowledged, and then check the WriteConcernResult.DocumentsAffected to see if it indeed updated a document or not.
My issue is that in all cases, the write concern result is returning back that 1 document was updated. :/
Here is an example document that my code is calling $addToSet on, which may or may not have this specific item in the "items" list to start with:
{
"_id" : "some-id-that-we-know-wont-change",
"items" : [
{
"s" : 4,
"i" : "some-value-we-know-is-static",
}
]
}
My query always uses an _id value which is known based on the processing metadata:
var query = new QueryDocument
{
{"_id", "some-id-that-we-know-wont-change"}
};
My update is as follows:
var result = mongoCollection.Update(query, new UpdateDocument()
{
{
"$addToSet", new BsonDocument()
{
{ "items", new BsonDocument()
{
{ "s", 4 },
{ "i", "some-value-we-know-is-static" }
}
}
}
}
}, new MongoUpdateOptions() { Flags = UpdateFlags.Upsert, WriteConcern = WriteConcern.Acknowledged });
if(result.DocumentsAffected > 0 || result.UpdatedExisting)
{
//DO SOME POST PROCESSING WORK THAT SHOULD ONLY HAPPEN ONCE PER ITEM
}
If i run this code one time on an empty collection, the document is added and response is as expected ( DocumentsAffected = 1, UpdatedExisting = false). If I run it again (any number of times), the document doesn't appear to be updated as it remains unchanged but the result is now unexpected (DocumentsAffected = 1, UpdatedExisting = true).
Shouldn't this be returning DocumentsAffected = 0 if the document is unchanged?
As we need to do many millions of these calls a day, I'm hesitant to turn this logic into multiple calls per item (first checking if the item exists in the given documents array, and then adding/queuing or just skipping) if at all possible.
Is there some way to get this working in a single call?

Of course what you are doing here is actually checking the response which does indicate whether a document was updated or inserted or in fact if neither operation happened. That is your best indicator as for an $addToSet to have performed an update the document would then be updated.
The $addToSet operator itself cannot produce duplicates, that is the nature of the operator. But you may indeed have some problems with your logic:
{
"$addToSet", new BsonDocument()
{
{ "items", new BsonDocument()
{
{ "id", item.Id },
{ "v", item.Value }
}
}
}
}
So clearly you are showing that an item in your "set" is composed of two fields, so if that content varies in any way ( i.e same id but different value) then the item is actually a "unique" member of the set and will be added. There would be no way for instance for the $addToSet operator to not add new values purely based on the "id" as a unique identifier. You would have to actually roll that in code.
A second possibility here for a form of duplicate is that your query portion is not correctly finding the document that has to be updated. The result of this would be creating a new document that contains only the newly specified member in the "set". So a common usage mistake is something like this:
db.collection.update(
{
"id": ABC,
"items": { "$elemMatch": {
"id": 123, "v": 10
}},
{
"$addToSet": {
"items": {
"id": 123, "v": 10
}
}
},
{ "upsert": true }
)
The result of that sort of operation would always create a new document because the existing document did not contain the specified element in the "set". The correct implementation is to not check for the presence of the "set" member and allow $addToSet to do the work.
If indeed you do have true duplicate entries occurring in the "set" where all elements of the sub-document are exactly the same, then it has been caused by some other code either present or in the past.
Where you are sure there a new entries being created, look through the code for instances of $push or indeed and array manipulation in code that seems to be acting on the same field.
But if you are using the operator correctly then $addToSet does exactly what it is intended to do.

MongoDB C# Upsert with Guid

When attempting to perform an upsert operation in Mongo, I'd like to have it generate a GUID for the ID instead of an Object ID. In this case, I'm checking to make sure an object with specific properties doesn't already exist and actually throwing an exception if the update occurs.
Here's a stub of the class definition:
public class Event
{
[BsonId(IdGenerator = typeof(GuidGenerator) )]
[BsonRepresentation(BsonType.String)]
[BsonIgnoreIfDefault]
public Guid Id { get; set; }
// ... more properties and junk
}
And here is how we are performing the upsert operation:
// query to see if there are any pending operations
var keyMatchQuery = Query<Event>.In(r => r.Key, keyList);
var statusMatchQuery = Query<Event>.EQ(r => r.Status, "pending");
var query = Query.And(keyMatchQuery , statusMatchQuery );
var updateQuery = new UpdateBuilder();
var bson = request.ToBsonDocument();
foreach (var item in bson)
{
updateQuery.SetOnInsert(item.Name, item.Value);
}
var fields = Fields<Request>.Include(req => req.Id);
var args = new FindAndModifyArgs()
{
Fields = fields,
Query = query,
Update = updateQuery,
Upsert = true,
VersionReturned = FindAndModifyDocumentVersion.Modified
};
// Perform the upsert
var result = Collection.FindAndModify(args);
Doing it this way will generate the ID as an ObjectID rather than a GUID.
I can definitely get the behavior I want as a two step operation by performing a .FindOne first, and if it fails, doing a direct insert:
var existingItem = Collection.FindOneAs<Event>(query);
if (existingItem != null)
{
throw new PendingException(string.Format("Event already pending: id={0}", existingItem.Id));
}
var result = Collection.Insert(mongoRequest);
In this case, it correctly sets the GUID for the new item, but the operation is non-atomic. I was searching for a way to set the default ID generation mechanism at the driver level, and thought this would do it:
BsonSerializer.RegisterIdGenerator(typeof(Guid), GuidGenerator.Instance);
...but to no avail, and I assume that's because for the upsert, the ID field can't be included so there is no serialization happening and Mongo is doing all of the work. I also looked into implementing a convention, but that didn't make sense since there are separate generation mechanisms to handle that. Is there a different approach I should be looking at for this and/or am I just missing something?
I do realize that GUIDs are not always ideal in Mongo, but we are exploring using them due to compatibility with another system.

What's happening is that only the server knows whether the FindAndModify is going to end up being an upsert or not, and as currently written it is the server that is automatically generating the _id value, and the server can only assume that the _id value should be an ObjectId (the server knows nothing about your class declarations).
Here's a simplified example using the shell showing your scenario (minus all the C# code...):
> db.test.drop()
> db.test.find()
> var query = { x : 1 }
> var update = { $setOnInsert : { y : 2 } }
> db.test.findAndModify({ query: query, update : update, new : true, upsert : true })
{ "_id" : ObjectId("5346c3e8a8f26cfae50837d6"), "x" : 1, "y" : 2 }
> db.test.find()
{ "_id" : ObjectId("5346c3e8a8f26cfae50837d6"), "x" : 1, "y" : 2 }
>
We know this was an upsert because we ran it on an empty collection. Note that the server used the query as an initial template for the new document (that's where the "x" came from), applied the update specification (that's where the "y" came from), and because the document had no "_id" it generated a new ObjectId for it.
The trick is to generate the _id client side in case it turns out to be needed, but to put it in the update specification in such a way that it only applies if it's a new document. Here's the previous example using $setOnInsert for the _id:
> db.test.drop()
> db.test.find()
> var query = { x : 1 }
> var update = { $setOnInsert : { _id : "E3650127-9B23-4209-9053-1CD989AE62B9", y : 2 } }
> db.test.findAndModify({ query: query, update : update, new : true, upsert : true })
{ "_id" : "E3650127-9B23-4209-9053-1CD989AE62B9", "x" : 1, "y" : 2 }
> db.test.find()
{ "_id" : "E3650127-9B23-4209-9053-1CD989AE62B9", "x" : 1, "y" : 2 }
>
Now we see that the server used the _id we supplied instead of generating an ObjectId.
In terms of your C# code, simply add the following to your updateQuery:
updateQuery.SetOnInsert("_id", Guid.NewGuid().ToString());
You should consider renaming your updateQuery variable to updateSpecification (or just update) because technically it's not a query.
There's a catch though... this technique is only going to work against the current 2.6 version of the server. See: https://jira.mongodb.org/browse/SERVER-9958

You seem to be following the recommended practice for this, but possibly this is bypassed with "upserts" somehow. The general problem seems to be that the operation does not actually know which "class" it is actually dealing with and has no way of knowing that it needs to call the custom Id generator.
Any value that you pass in to MongoDB for the _id field will always be honored in place of generating the default ObjectID. Therefore if that field is included in the update "document" portion of the statement it will be used.
Probably the safest way to do this when expecting "upsert" behavior is to use the $setOnInsert modifier. Anything specified in here will only be set when an insert occurs from a related "upsert" operation. So in general terms:
db.collection.update(
{ "something": "matching" }
{
// Only on insert
"$setOnInsert": {
"_id": 123
},
// Always applied on update
"$set": {
"otherField": "value"
}
},
{ upsert: true }
)
So anything within the $set ( or other valid update operators ) will always be "updated" when the matching "query" condition is found. The $setOnInsert fields will be applied when the "insert" actually occurs due to no match. Naturally any literal conditions used in the query portion to "match" are also set so that future "upserts" will issue an "update" instead.
So as long as you structure your "update" BSON document to include your newly generated GUID in this way then you will always get the correct value in there.
Much of your code is on the right track, but you will need to invoke the method from your generator class and place value in the $setOnInsert portion of the statement, which you are already using, but just not including that _id value yet.

Sorting by aggregate of two fields

I have a mongo database with documents that look like this:
{
PublishedDate: [date],
PublishedDateOverride: [NullableDate],
...
}
The reason I have the override as a separate field is that it is important to know the original published date as well as the overridden one.
When I get these documents back I want to sort them by their "apparent" published date. That is if there is an override it should use that, otherwise use the original.
Our current system just sorts by PublishedDateOverride and then by PublishedDate which of course groups all of those with a null override together.
For a concrete example take the following four documents:
A = {
PublishedDate: 2014-03-14,
PublishedDateOverride: 2014-03-24,
...
}
B = {
PublishedDate: 2014-01-21,
PublishedDateOverride: 2014-02-02,
...
}
C = {
PublishedDate: 2014-03-01,
PublishedDateOverride: null,
...
}
D = {
PublishedDate: 2014-03-27,
PublishedDateOverride: null,
...
}
The desired sort order would be D (2014-03-27), A (2014-03-14), C (2014-03-01), B (2014-02-02).
I need to be able to do this in the database since I am also paging this data so I can't just sort after getting it out of the database.
So the question:
What is the best way to achieve this goal? Is there a way to sort by an expression? Is there a way to have a calculated field such that whenever I update a document it will put the appropriate date in there to sort on?
I'm doing this in C# in case that is relevant but I would assume any solution would be a mongo one, not in my client code.

If you want a projection of only the valid and greater date then use aggregate with the $cond operator and the $gt operator. A basic shell example for translation (which is not hard) :
db.collection.aggregate([
{ "$project": {
"date": { "$cond": [
{ "$gt": [
"$PublishedDate",
"$PublishedDateOverride"
]},
"$PublishedDate",
"$PublishedDateOverride"
]}
}},
{ "$sort": { "date": 1 } }
])
So that basically breaks down your documents to having the "date" field set to which ever of those two fields had the greater value. Then you can sort on the result. All processed server side.

try
datesCollection.OrderBy(d => d.PublishedDateOverride!= null? d.PublishedDateOverride: d.PublishedDate)

Use Indexes to Sort Query Results
To sort on multiple fields, create a compound index.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.