I Am trying to do a cross partition query on Azure CosmosDB without a partition key. The throughput is set to be 4000, I get 250RU/s per partition key range.
My cosmos db collection has about 1million documents and is a total of 70gb in size. They are spread evenly across approx 40,000 logical partitions, the json documents are on average 100kb in size. This is what the structure of my json documents look like:
"ArrayOfObjects": [
{
// other properties omitted for brevity
"SubId": "ed2a49fb-51d4-45b4-9690-df0721d6a32f"
},
{
"SubId": "35c87833-9bea-4151-86da-4d9c482ae1fe"
},
"ParitionKey": "b42"
This is how I am querying currently without a partition key:
public async Task<ResponseModel> GetBySubId(string subId)
{
var collectionId = _cosmosClient.CollectionId;
var query = $#"SELECT * FROM {collectionId} c
WHERE ARRAY_CONTAINS(c.ArrayOfObjects, {{'SubId': '{subId}'}}, true)";
var feedOptions = new FeedOptions { EnableCrossPartitionQuery = true };
var docQuery = _cosmosClient.Client.CreateDocumentQuery(
_collectionUri,
query,
feedOptions)
.AsDocumentQuery();
var results = new List<ResponseModel>();
while (docQuery.HasMoreResults)
{
var executedQuery = await docQuery.ExecuteNextAsync<ResponseModel>();
if (executedQuery.Count != 0)
{
results.AddRange(executedQuery.ToList());
}
}
if (results.Count == 0)
{
return null;
}
return results.FirstOrDefault();
}
I am expecting to to be able to retrieve the document via one of the SubId's right after inserting it. What actually happens is that it is unable to get the document and returns back null even after the query finishes execution by draining all continuation tokens. This issue is intermittent and inconsistent as sometimes it can get the document after it is inserted other times not.
For those documents that are failing to be retrieved after being inserted, if you wait some time (a couple of minutes usually) and repeat the query with the same SubId it is able to then retrieve the document. There seems to be a delay.
I have checked the cosmosdb metrics in the Azure portal, the metrics indicate that I have not exceeded the provisioned RU/s per partition at all or that there has been any rate limiting in my requests (HTTP 429).
Given the above why am I still seeing issues with cross partition querying even when there is enough throughput provisioned?
Related
I`m using MongoDB.Driver for .NET to query Mongo(Version 3.0.11). This is my code to query a field and limit the query to 200 documents.
BsonDocument bson = new BsonDocument();
bson.Add("Field", "Value");
BsonDocumentFilterDefinition<ResultClass> filter = new BsonDocumentFilterDefinition<ResultClass>(bson);
FindOptions queryOptions = new FindOptions() { BatchSize = 200 };
List<ResultClass> result = new List<ResultClass>();
result.AddRange(myCollection.Find<ResultClass>(filter, queryOptions).Limit(200).ToList());
My issue is that when I check the database`s current operations, the operation query field shows only :
{ Field : "Value" }
Which is different from the query using "AsQueryable" below:
List<ResultClass> result = myCollection.AsQueryable<ResultClass>().Where(t => t.Field == "Value").Take(200)
Query operation using "AsQueryable"
{ aggregate: "CollectionName", pipeline: [ { $match: { Field:
"Value" } }, { $limit: 200 } ], cursor: {} }
Why can't I see the limit in the query using Find?Is the limit being handled in the client side instead of the server?
I need to limit this in the server side but I can't use the second query because the field searched needs to be a string which can't be done using AsQueryable.
Using limit in the first piece of code executes the limit on a cursor object, which is still serverside until you actually request the document by invoking ToList(). At which point only 200 documents will go over the wire to your application.
It looks like the AsQueryable is executing an aggregation pipeline which will show up in currentOp, but both are essentially the same.
I'm not sure if there is a performance impact for either one though
I am writing a script that return all unprocessed partitions within a measure group using the following command:
objMeasureGroup.Partitions.Cast<Partition>().Where(x => x.State != AnalysisState.Processed)
After doing some experiments, it looks like this property indicates if the data is processed and doesn't mention the indexes.
After searching for hours, i didn't find any method to list the partitions where data is processed but indexes are not.
Any suggestions?
Environment:
SQL Server 2014
SSAS multidimensional cube
Script are written within a SSIS package / Script task
First, ProcessIndexes is an incremental operation. So if you run it twice the second time will be pretty quick because there is nothing to do. So I would recommend just running it on the cube and not worrying about whether it was previously run. However if you do need to analyze the current state then read on.
The best way (only way I know of) to distinguish whether ProcessIndexes has been run on a partition is to study the DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT DMVs as seen below.
The DISCOVER_PARTITION_STAT DMV returns one row per aggregation with the rowcount. The first row of that DMV has a blank aggregation name and represents the rowcount of the lowest level data processed in that partition.
The DISCOVER_PARTITION_DIMENSION_STAT DMV can tell you about whether indexes are processed and which range of values by each dimension attribute are in this partition (by internal IDs, so not super easy to interpret). We assume at least one dimension attribute is set to be optimized so it will be indexed.
You will need to add a reference to Microsoft.AnalysisServices.AdomdClient also to simplify running these DMVs:
string sDatabaseName = "YourDatabaseName";
string sCubeName = "YourCubeName";
string sMeasureGroupName = "YourMeasureGroupName";
Microsoft.AnalysisServices.Server s = new Microsoft.AnalysisServices.Server();
s.Connect("Data Source=localhost");
Microsoft.AnalysisServices.Database db = s.Databases.GetByName(sDatabaseName);
Microsoft.AnalysisServices.Cube c = db.Cubes.GetByName(sCubeName);
Microsoft.AnalysisServices.MeasureGroup mg = c.MeasureGroups.GetByName(sMeasureGroupName);
Microsoft.AnalysisServices.AdomdClient.AdomdConnection conn = new Microsoft.AnalysisServices.AdomdClient.AdomdConnection(s.ConnectionString);
conn.Open();
foreach (Microsoft.AnalysisServices.Partition p in mg.Partitions)
{
Console.Write(p.Name + " - " + p.State + " - ");
var restrictions = new Microsoft.AnalysisServices.AdomdClient.AdomdRestrictionCollection();
restrictions.Add("DATABASE_NAME", db.Name);
restrictions.Add("CUBE_NAME", c.Name);
restrictions.Add("MEASURE_GROUP_NAME", mg.Name);
restrictions.Add("PARTITION_NAME", p.Name);
var dsAggs = conn.GetSchemaDataSet("DISCOVER_PARTITION_STAT", restrictions);
var dsIndexes = conn.GetSchemaDataSet("DISCOVER_PARTITION_DIMENSION_STAT", restrictions);
if (dsAggs.Tables[0].Rows.Count == 0)
Console.WriteLine("ProcessData not run yet");
else if (dsAggs.Tables[0].Rows.Count > 1)
Console.WriteLine("aggs processed");
else if (p.AggregationDesign == null || p.AggregationDesign.Aggregations.Count == 0)
{
bool bIndexesBuilt = false;
foreach (System.Data.DataRow row in dsIndexes.Tables[0].Rows)
{
if (Convert.ToBoolean(row["ATTRIBUTE_INDEXED"]))
{
bIndexesBuilt = true;
break;
}
}
if (bIndexesBuilt)
Console.WriteLine("indexes have been processed. no aggs defined");
else
Console.WriteLine("no aggs defined. need to run ProcessIndexes on this partition to build indexes");
}
else
Console.WriteLine("need to run ProcessIndexes on this partition to process aggs and indexes");
}
I am posting this answer as additional information of #GregGalloway excellent answer
After searching for a while, the only way to know if partition are processed is using DISCOVER_PARTITION_STAT and DISCOVER_PARTITION_DIMENSION_STAT.
I found an article posted by Daren Gossbel describing the whole process:
SSAS: Are my Aggregations processed?
In the artcile above the author provided two methods:
using XMLA
One way in which you can find it out with an XMLA discover call to the DISCOVER_PARTITION_STAT rowset, but that returns the results in big lump of XML which is not as easy to read as a tabular result set.
example
<Discover xmlns="urn:schemas-microsoft-com:xml-analysis">
<RequestType>DISCOVER_PARTITION_STAT</RequestType>
<Restrictions>
<RestrictionList>
<DATABASE_NAME>Adventure Works DW</DATABASE_NAME>
<CUBE_NAME>Adventure Works</CUBE_NAME>
<MEASURE_GROUP_NAME>Internet Sales</MEASURE_GROUP_NAME>
<PARTITION_NAME>Internet_Sales_2003</PARTITION_NAME>
</RestrictionList>
</Restrictions>
<Properties>
<PropertyList>
</PropertyList>
</Properties>
</Discover>
using DMV queries
If you have SSAS 2008, you can use the new DMV feature to query this same rowset and return a tabular result.
example
SELECT *
FROM SystemRestrictSchema($system.discover_partition_stat
,DATABASE_NAME = 'Adventure Works DW 2008'
,CUBE_NAME = 'Adventure Works'
,MEASURE_GROUP_NAME = 'Internet Sales'
,PARTITION_NAME = 'Internet_Sales_2003')
Similar posts:
How to find out using AMO if aggregation exists on partition?
Detect aggregation processing state with AMO?
I have a CosmosDB collection that is partitioned and where throughput is set to 10,000 RU/s (the problem does not occur when throughput is below 6100 RU/s).
Now I issue an arbitrary document query (for example to retrieve all documents in the collection) with a variable pageSize and a continuationToken (initially set to null):
var q = DocumentClient.CreateDocumentQuery<T>(CollectionUri,
new FeedOptions
{
MaxItemCount = pageSize,
EnableCrossPartitionQuery = true,
RequestContinuation = continuationToken
});
Now if I call
FeedResponse<T> response = await q.ExecuteNextAsync<T>();
I would expect the response to be paged according to the specified pageSize. In particular, if pageSize = -1 or pageSize = int.MaxValue, I want only exactly one page with all results to be returned. However, the resulting pages are fragmented along the partitions.
For example, with pageSize = -1 or pageSize = int.MaxValue I would get a page with 18 objects from the first partition, and only when ExecuteNextAsync is called a second time, I would get the remaining 35 objects from the other two partitions.
With pageSize = 17 I would first get a page with 17 objects on the first call of ExecuteNextAsync, then a page with 1 object on the next call, and then another page with 17 objects!
But this renders paging (almost) completely useless! Or is there a way to implement paging properly (even when throughput is above 6000 RU/s)?
Based on Nick Chapsas' information that ExecuteNextAsync may return fewer than MaxItemCount items even if more are available, I am using the following workaround:
List<T> result = new List<T>();
string continuationToken = null;
IDocumentQuery<T> docQuery = queryable.AsDocumentQuery();
// ugly hack to get the feed options using reflection
FeedOptions feedOptions = docQuery.GetNonPublicProperty<FeedOptions>("feedOptions");
while (docQuery.HasMoreResults && (pageSize <= 0 || result.Count < pageSize))
{
if (feedOptions != null && pageSize > 0)
{
feedOptions.MaxItemCount = pageSize - result.Count;
}
FeedResponse<T> response = await docQuery.ExecuteNextAsync<T>();
result.AddRange(response.ToList());
continuationToken = response.ResponseContinuation;
}
return (result, continuationToken);
Getting the private property using reflection is not very nice, but there doesn't seem to be any other way to get hold of the query's FeedOptions. In particular, the FeedOptions used for calling DocumentClient.CreateDocumentQuery<T> are cloned internally, so it's really a private instance.
MaxItemCount represents the maximum data that a single request to a partition will return. It is not guaranteed to always be that and sometimes, it will even be empty.
For that reason, you should leave MaxItemCount out of your pagination logic, as it has nothing to do with what you're trying to achieve.
Instead what you really want is the following:
Here's an implementation with a pageSize & nextPageToken combo. The continuation token is in the FeedOptions of the query;
var results = new List<T>();
var nextPageToken = string.Empty;
while (query.HasMoreResults)
{
if (results.Count == pageSize)
break;
var items = await query.ExecuteNextAsync<T>(cancellationToken);
nextPageToken = items.ResponseContinuation;
foreach (var item in items)
{
results.Add(item);
if (results.Count == pageSize)
break;
}
}
return (results, nextPageToken);
For this to work on any RU/s, you will need to either wrap your query.ExecuteNextAsync<T>(cancellationToken); call with a retry wrapper or simply rump up the DocumentClient's retry options.
For further implementation details you can take a look on how Cosmonaut handles pagination and solves this issue and more specifically here. (Full disclosure, I am the creator of this library but I don't want to paste the full implementation here)
I have more than 15000 POCO elements stored in a Redis List. I'm using ServiceStack in order to save and get them. However, I'm not pleased about the response times that I have when I get them into a grid. As I read , it would be better to store these object in hash - but unfortunately I could not find any good example for my case :(
This is the method I use, in order to get them into my grid
public IEnumerable<BookingRequestGridViewModel> GetAll()
{
try
{
var redisManager = new RedisManagerPool(Global.RedisConnector);
using (var redis = redisManager.GetClient())
{
var redisEntities = redis.As<BookingRequestModel>();
var result =redisEntities.Lists["BookingRequests"].GetAll().Select(z=> new BookingRequestGridViewModel
{
CreatedDate =z.CreatedDate,
DropOffBranchName =z.DropOffBranch !=null ? z.DropOffBranch.Name : string.Empty,
DropOffDate =z.DropOffDate,
DropOffLocationName = z.DropOffLocation != null ? z.DropOffLocation.Name : string.Empty,
Id =z.Id.Value,
Number =z.Number,
PickupBranchName =z.PickUpBranch !=null ? z.PickUpBranch.Name :string.Empty,
PickUpDate =z.PickUpDate,
PickupLocationName = z.PickUpLocation != null ? z.PickUpLocation.Name : string.Empty
}).OrderBy(z=>z.Id);
return result;
}
}
catch (Exception ex)
{
return null;
}
}
Note that I use redisEntities.Lists["BookingRequests"].GetAll() which is causing performance issues (I would like to use just redisEntities.Lists["BookingRequests"] but I lose last updates from grid - after editing)
I would like to know if saving them into list is a good approach as for me it's very important to have a fast grid (I have now 1 second at paging which is huge).
Please, advice!
Firstly you should not create a new Redis Client Manager like RedisManagerPool instance each time, there should only be a singleton instance of RedisManagerPool in your App which all clients are resolved from.
But otherwise I would rethink your data access strategy, downloading 15K items in a batch is not an ideal strategy. You can create indexes by storing ids in Sets or you could store items in a sorted set with a value that you can page against like an incrementing id, e.g:
var redisEntities = redis.As<BookingRequestModel>();
var bookings = redisEntities.SortedSets["bookings"];
foreach (var item in new BookingRequestModel[0])
{
redisEntities.AddItemToSortedSet(bookings, item, item.Id);
}
That way you will be able to fetch them in batches, e.g:
var batch = bookings.GetRangeByLowestScore(fromId, toId, skip, take);
I have a stored procedure which gives me a document count (count.js on github). I have partitioned my collection. Due to this, I now have to pass the partition key in as an option to run the stored procedure.
Can and how should I enable crosspartition queries in the stored procedure (ie, collection(EnableCrossPartitionQuery = true)) so that I don't have to specify the partition key?
There is no way to do fan-out stored procedure execution in DocumentDB. The run against a single partition. I ran into this dilemma when trying to switch to partitioned collections and had to make some adjustments. Here are some options:
Download a 1 for every record and sum/count them client-side
Rerun the stored procedure for each unique partition key. In my case, this was not as bad as it sounds since the partition key is a tenantID and I only have a dozen of those and only expect a few hundred max.
I'm not sure about this one since I haven't tried it with partitioned collections, but each query now returns the resource usage of the collection in the x-ms-resource-usage header. That header has a documentsSize sub-header. You could use that divided by the average size of your documents to get an approximate count. There may even be a count record in that header information by now.
Also, there is an x-ms-item-count header but I'm not sure how that behaves. If you send a query for all the records in the entire partitioned collection and set the max-item-count to 1, you'll only get back one record and it shouldn't cost you a lot in RUs, but I don't know how that header behaves. Does it return a 1 in that case? Or does it return the total number of documents all the pages of the query would eventually return if you bothered to request every page. A quick experiment should confirm this.
Below you can find some example code that should allow you to read all records cross partion. The magic is inside the doForAll function, and at the top you can see how it is called.
// SAMPLE STORED PROCEDURE
function sample(prefix) {
var share = { counter: 0, hasEntityName : 0, isXXX: 0, partitions: {}, prefix };
doForAll({
filter: function limiter(record){
if (record && record.entityName === 'XXX') return true;
else return false;
},
callback: function handleRecord(record) {
//Keep track of this partition...
let partitionKey = record.partitionKey;
if (share.partitions[partitionKey])
share.partitions[partitionKey]++;
else
share.partitions[partitionKey] = 1;
//update some counters...
share.counter++;
if (record.entityName !== undefined) share.hasEntityName++;
if (record.entityName === 'XXX') share.isXXX++;
},
finaly: function whenAllIsDone() {
console.log("counter = " + share.counter + ". ");
console.log("has entity name: "+ share.hasEntityName+ ". ")
console.log("is XXX: " + share.isXXX+ ". ")
var parts = Object.getOwnPropertyNames(share.partitions)
console.log("partition keys: " + parts.length + " ...");
getContext()
.getResponse()
.setBody(share);
}
});
//The magic function...
//also see: https://azure.github.io/azure-cosmosdb-js-server/Collection.html
function doForAll(task, ctoken) {
if (!task) throw "Expected one parameter of type: { filter?: (rec?)=>boolean, callback?: (rec?) => void, finaly?: () => void }";
//Note:
//the "__" symbol is an alias for var collection = getContext().getCollection(); = aliased by __
var result = getContext()
.getCollection()
.chain()
.filter(task.filter || function (rec) { return true; })
.map(task.callback || function (rec) { return undefined; })
.value({ continuation: ctoken }, function afterBatchCallback (err, feed, options) {
if (err) throw err;
if (options.continuation)
doForAll(task, options.continuation);
else if (task.finaly)
task.finaly();
});
if (!result.isAccepted)
throw "catastrophic failure";
}
}
PS: it may to know how the data looks like that is used for the example.
This is an example of such a document:
{
"id": "123",
"partitionKey": "PART_1",
"entityName": "EXAMPLE_ENTITY",
"veryInterestingInfo": "The 'id' property is also the collections id, the 'partitionKey' property happens to be the collections partition key, and all the records in this collection have a 'entityName' property which contains a (non-unique) string"
}