Azure Table Storage multi row query performance

Azure Table Storage multi row query performance - c#

We have had issue in a service utilizing Azure Table Storage where sometimes the queries take multiple seconds (3 to 30 seconds). This happens daily, but only for some of the queries. We do not have huge load on the service and the table storage (some hundreds of calls per hour). But still the table storage is not performing.
The slow queries are all doing filter queries that should return in maximum 10 rows. I have the filters structured so that there is always partition key and row key joined by and followed by next pair of partition and row keys after an or operator:
(partitionKey1 and RowKey1) or (partitionKey2 and rowKey2) or (partitionKey3 and rowKey3)
So qurrently I am on the premise that I need to split the query into separate queries. This was somewhat verified with a python script I did. Where when I repeat same query as single query (combined query with or's and expecting multiple rows as result) or split to multiple queries executed in separate treads, I see the combined query slow up every now and then.
import time
import threading
from azure.cosmosdb.table.tableservice import TableService
from azure.cosmosdb.table.models import Entity
############################################################################
# Script for querying data from azure table storage or cosmos DB table API.
# SAS token needs to be generated for using this script and a table with data
# needs to exist.
#
# Warning: extensive use of this script may burden the table performance,
# so use with care.
#
# PIP requirements:
# - requires azure-cosmosdb-table to be installed
# * run: 'pip install azure-cosmosdb-table'
dateTimeSince = '2019-06-12T13:16:45.446Z'
sasToken = 'SAS_TOKEN_HERE'
tableName = 'TABLE_NAME_HERER'
table_service = TableService(account_name="ACCOUNT_NAME_HERE", sas_token=sasToken)
tableFilter = "(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0') and (RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34') and (RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc') and (RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56') and (RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a') and (RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')"
resultDict = {}
# Do separate queries
filters = tableFilter.split(" or ")
threads = []
def runQueryPrintResult(filter):
result = table_service.query_entities(table_name=tableName, filter=filter)
item = result.items[0]
resultDict[item.RowKey] = item
# Loop where:
# - Step 1: test is run with tableFilter query split to multiple threads
# * returns single row per query
# - Step 2: Query is runs tableFilter query as single query
# - Press enter to repeat the two query tests
while 1:
start2 = time.time()
for filter in filters:
x = threading.Thread(target=runQueryPrintResult, args=(filter,))
x.start()
threads.append(x)
for x in threads:
x.join()
end2 = time.time()
print("Time elapsed with multi threaded implementation: {}".format(end2-start2))
# Do single query
start1 = time.time()
listGenerator = table_service.query_entities(table_name=tableName, filter=tableFilter)
end1 = time.time()
print("Time elapsed with single query: {}".format(end1-start1))
counter = 0
allVerified = True
for item in listGenerator:
if resultDict[item.RowKey]:
counter += 1
else:
allVerified = False
if len(listGenerator.items) != len(resultDict):
allVerified = False
print("table item count since x: " + str(counter))
if allVerified:
print("Both queries returned same amount of results")
else:
print("Result count does not match, single threaded count={}, multithreaded count={}".format(len(listGenerator.items), len(resultDict)))
input('Press enter to retry test!')
Here is an example output from the python code:
Time elapsed with multi threaded implementation: 0.10776209831237793
Time elapsed with single query: 0.2323908805847168
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.0897986888885498
Time elapsed with single query: 0.21547174453735352
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.08280491828918457
Time elapsed with single query: 3.2932426929473877
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07794523239135742
Time elapsed with single query: 1.4898555278778076
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
Time elapsed with multi threaded implementation: 0.07962584495544434
Time elapsed with single query: 0.20011520385742188
table item count since x: 5
Both queries returned same amount of results
Press enter to retry test!
The service we have problems with is implemented in C# though and I have yet to reproduce the results gotten with python script on the C# side. There I seem to have worse performance when splitting the query to multiple separate queries vs using single filter query (returning all the required rows).
So doing following multiple times and awaiting all to complete seems to be slower:
TableOperation getOperation =
TableOperation.Retrieve<HqrScreenshotItemTableEntity>(partitionKey, id.ToString());
TableResult result = await table.ExecuteAsync(getOperation);
Than doing all in single query:
private IEnumerable<MyTableEntity> GetBatchedItemsTableResult(Guid[] ids, string applicationLink)
{
var table = InitializeTableStorage();
TableQuery<MyTableEntity> itemsQuery=
new TableQuery<MyTableEntity>().Where(TableQueryConstructor(ids, applicationLink));
IEnumerable<MyTableEntity> result = table.ExecuteQuery(itemsQuery);
return result;
}
public string TableQueryConstructor(Guid[] ids, string applicationLink)
{
var fullQuery = new StringBuilder();
foreach (var id in ids)
{
// Encode link before setting to partition key as REST GET requests
// do not accept non encoded URL params by default)
partitionKey = HttpUtility.UrlEncode(applicationLink);
// Create query for single row in a requested partition
string queryForRow = TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, partitionKey),
TableOperators.And,
TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, id.ToString()));
if (fullQuery.Length == 0)
{
// Append query for first row
fullQuery.Append(queryForRow);
}
else
{
// Append query for subsequent rows with or operator to make queries independent of each other.
fullQuery.Append($" {TableOperators.Or} ");
fullQuery.Append(queryForRow);
}
}
return fullQuery.ToString();
}
The test case used with the C# code is quite different though from the python test. In C# I am querying 2000 rows from data of something like 100000 rows. If the data is queried in batches of 50 rows the latter filter query beats the single row query run in 50 tasks.
Maybe I should just repeat the test I did with python in C# as a console app to see if the .Net client api seems to behave the same way as python perf vice.

I think you should use multi-threaded implementation, since it consists of multiple Point Query. Doing all in single query probably results in a Table Scan. As the official doc mentions:
Using an "or" to specify a filter based on RowKey values results in a partition scan and is not treated as a range query. Therefore, you should avoid queries that use filters such as: $filter=PartitionKey eq 'Sales' and (RowKey eq '121' or RowKey eq '322')
You might think the example above is two Point Queries, but it actually results in a Partition Scan.

To me the answer here seems to be that executing queries on table storage has not been optimized to work with OR operator as you would expect. Query is not handled as point query when it combines point queries with OR operator.
This can be reproduced in python, C# and Azure Storage Explorer in which all where if you combine point queries with OR it can be 10x slower (or even more) than doing separate point queries that only return one row.
So most efficient way to get number of rows with partition and row keys known is to do them all with separate async queries with TableOperation.Retrieve (in C#). Using TableQuery is highly inefficient and does not produce results anywhere near the performance scalability targets for Azure Table Storage are leading to expect. Scalability targets say for example: "Target throughput for a single table partition (1 KiB-entities) Up to 2,000 entities per second". And here I was not even able to be served with 5 rows per second although all rows were in different partitions.
This limitation in query performance is not very clearly stated anywhere in any documentation or performance optimization guide, but it could be understod from these lines in the Azure storage performance checklist:
Querying
This section describes proven practices for querying the table service.
Query scope
There are several ways to specify the range of entities to query. The following is a discussion of the uses of each.
In general, avoid scans (queries larger than a single entity), but if you must scan, try to organize your data so that your scans retrieve the data you need without scanning or returning significant amounts of entities you don't need.
Point queries
A point query retrieves exactly one entity. It does this by specifying both the partition key and row key of the entity to retrieve. These queries are efficient, and you should use them wherever possible.
Partition queries
A partition query is a query that retrieves a set of data that shares a common partition key. Typically, the query specifies a range of row key values or a range of values for some entity property in addition to a partition key. These are less efficient than point queries, and should be used sparingly.
Table queries
A table query is a query that retrieves a set of entities that does not share a common partition key. These queries are not efficient and you should avoid them if possible.
So "A point query retrieves exactly one entity" and "Use point queries when ever possible". Since I had split the data to partitions, it may have been handled as table query: "A table query is a query that retrieves a set of entities that does not share a common partition key". This although the query combined set of point queries as it listed both partition and row keys for all entities that were expected. But since the combined query was not retriewing only one query it cannot be expected to perform as point query (or set of point queries).

Posting as an answer since it was getting bigger for comments.
Can you try by changing your query to something like the following:
(PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_ed6d31b0' and RowKey eq 'ed6d31b0-d2a3-4f18-9d16-7f72cbc88cb3') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9be86f34' and RowKey eq '9be86f34-865b-4c0f-8ab0-decf928dc4fc') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_97af3bdc' and RowKey eq '97af3bdc-b827-4451-9cc4-a8e7c1190d17') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_9d557b56' and RowKey eq '9d557b56-279e-47fa-a104-c3ccbcc9b023') or (PartitionKey eq 'http%3a%2f%2fsome_website.azurewebsites.net%2fApiName_e251a31a' and RowKey eq 'e251a31a-1aaa-40a8-8cde-45134550235c')

Related

Getting multiple entities using Azure TableStorage over multiple partitions

I'm using Azure.Data.Tables (12.6.1) and I need to query a single record from multiple partitions of a single table (so the result would be multiple records, 1 from each partition). Each entity needs to be looked up by its partition key and row key - for a single TableClient.GetEntity() call this would be a point query.
After reading the documentation I'm confused if it's efficient or not to call TableClient.QueryAsync() with multiple partition key / row key pairs and the search results I found provide contradicting suggestions.
Is it efficient to do this (for a number of partition key / row key combinations, up to ~50) or is it just better to call GetEntity() one by one, for each entity?
var filter = "(PartitionKey eq 'p1' And RowKey eq 'r1') Or " +
"(PartitionKey eq 'p2' And RowKey eq 'r2') Or ...";
var results = await tableClient.QueryAsync(filter, 500, null, cancelToken);

I don't know if there is a definitive answer here as it probably depends on your specific requirements. I would suggest testing different options and tune accordingly.
Just for reference, here is a general reference about query performance for tables https://learn.microsoft.com/azure/storage/tables/table-storage-design-for-query

I settled on parallelizing point queries for this scenario, and has given good results. I have heavy-burst read scenarios, I may have 10's/100's of 1000's of lookups to do against 100's of millions of records). I prefer that over a query with a series of OR's, as those were tending to give worse throughput (I don't have any stats to hand now....)
For me parallelization happens through 2 means:
lower level: awaiting a batch of Tasks, each making an individual point query
higher level: architecting a particularly heavy workload to scale out over multiple instances, each making parallel queries via 1)

Firestore order of ORDER, LIMIT, WHERE clauses

Assume we have a dataset like this:
Name
Population
Capital
London
8,799,800
true
Barcelona
1,620,343
false
Luxembourg City
128,512
true
When working with SQL databases, the order in which you typed your ORDER, LIMIT and WHERE clauses is important. For instance:
SELECT *
FROM Cities
WHERE Capital = true
ORDER BY Population DESC
LIMIT 2
would return London and Luxembourg City. Whereas
SELECT *
FROM Cities
ORDER BY Population DESC
LIMIT 2
WHERE Capital = true
would return London
(the second statement is probably illegal but that's a different thing)
In Firestore (I'm using the C# SDK), we can set the query arguments and then get the snapshot. We could do:
Query query = citiesRef
.WhereEqualTo("Capital", true)
.OrderByDescending("Population")
.Limit(2);
QuerySnapshot querySnapshot = await query.GetSnapshotAsync();
We could also run this:
Query query = citiesRef
.OrderByDescending("Population")
.Limit(2)
.WhereEqualTo("Capital", true);
QuerySnapshot querySnapshot = await query.GetSnapshotAsync();
If Firestore was to behave in the way that a SQL database does, these 2 statements would potentially return different results. However, when we run them, they seem to return the same results.
Are the two Firestore queries equivalent?

Yes, they are the same. When you specify a limit, you are always limiting the final set of results, after all filters and orders, that would be returned to the client so that the results can be paginated by the client. It is never limiting the data set before any other operations are applied.

LINQ or SQL: Group by with a sum of distinct values

I have a database table, on SQL Server 2019, containing a time series of prices collected with multiple frequencies (daily, weekly or monthly) which I query using EF Core 3.1
I'm trying to extract these prices, aggregated by month, but without losing the information of the collection frequency.
From the following set of data:
I'm trying to get this one, which contains the aggregate average value of the prices, grouped by Month, and with the frequencies of the raw records.
These could be easily solved by using
string.Join(",",s.Select(innerSel=>innerSel.OriginalFrequency).Distinct())
but unfortunately, I can't use as I need to work on IQueryable objects and run the execution of the LINQ query only at the end when I take a subset of data, based on the page-size, because converting to a List the query before grouping means to get several thousands of records from the DB.
I was trying to use a combination of SUM and COUNT of the frequencies in order to easily understand which is the original combination by multiplication these two values (see the schema below) but the COUNT and SUM should count only distinct values, otherwise, it doesn't work.
Is there a way to not lose this information in some way, without overloading the database server requesting unnecessary data, or making multiple requests?
This is the code where I'm stuck:
var aggregatedMonthlyPrices = prices.GroupBy(g => new
{
g.DateMonth,
g.DateYear
}).Select(s => new
{
DateMonth = s.Key.DateMonth,
DateYear = s.Key.DateYear
Price=s.Average(avg=>avg.Price),
FrequencySum= s.Sum(sum=>sum.DataCollectionFrequencyId),
FrequencyCount = s.Count(),
});

Entity Framework DbContext filtered query for count is extremely slow using a variable

Using an ADO.NET entity data model I've constructed two queries below against a table containing 1800 records that has just over 30 fields that yield staggering results.
// Executes slowly, over 6000 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == _custID).Count();
// Executes instantly, under 20 ms
int count = context.viewCustomers.AsNoTracking()
.Where(c => c.Cust_ID == 625).Count();
I see from the database log that Entity Framework provides that the queries are almost identical except that the filter portion uses a parameter. Copying this query into SSMS and declaring & setting this parameter there results in a near instant query so it doesn't appear to be on the database end of things.
Has anyone encountered this that can explain what's happening? I'm at the mercy of a third party control that adds this command to the query in an attempt to limit the number of rows returned, getting the count is a must. This is used for several queries so a generic solution is needed. It is unfortunate it doesn't work as advertised, it seems to only make the query take 5-10 times as long as it would if I just loaded the entire view into memory. When no filter is used however, it works like a dream.
Use of these components includes the source code so I can change this behavior but need to consider which approaches can be used to provide a reusable solution.

You did not mention about design details of your model but if you only want to have count of records based on condition, then this can be optimized by only counting the result set based on one column. For example,
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Count();
If you design have 10 columns, and based on above statement let say 100 records have been returned, then against every record result set contains 10 columns' data which is of not use.
You can optimize this by only counting result set based on single column.
int count = context.viewCustomers.AsNoTracking().Where(c => c.Cust_ID == _custID).Select(x=>new {x.column}).Count();
Other optimization methods, like using async variants of count CountAsync can be used.

Performance issues to iterate results with C# SQLite DataReader and attached database

I am using System.Data.SQLite and SQLiteDataReader in my C# project. I am facing performance issues when getting the results of a query with attached databases.
Here is an example of a query to search text into two databases :
ATTACH "db2.db" as db2;
SELECT MainRecord.RecordID,
((LENGTH(MainRecord.Value) - LENGTH(REPLACE(UPPER(MainRecord.Value), UPPER("FirstValueToSearch"), ""))) / 18) AS "FirstResultNumber",
((LENGTH(DB2Record.Value) - LENGTH(REPLACE(UPPER(DB2Record.Value), UPPER("SecondValueToSearch"), ""))) / 19) AS "SecondResultNumber"
FROM main.Record MainRecord
JOIN db2.Record DB2Record ON DB2Record.RecordID BETWEEN (MainRecord.PositionMin) AND (MainRecord.PositionMax)
WHERE FirstResultNumber > 0 AND SecondResultNumber > 0;
DETACH db2;
When executing this query with SQLiteStudio or SQLiteAdmin, this works fine, I am getting the results in a few seconds (the Record table can contain hundreds of thousands of records, the query returns 36000 records).
When executing this query in my C# project, the execution takes a few seconds too, but it takes hours to run through all the results.
Here is my code :
// Attach databases
SQLiteDataReader data = null;
using (SQLiteCommand command = this.m_connection.CreateCommand())
{
command.CommandText = "SELECT...";
data = command.ExecuteReader();
}
if (data.HasRows)
{
while (data.Read())
{
// Do nothing, just iterate all results
}
}
data.Close();
// Detach databases
Calling the Read method of the SQLiteDataReader once can take more than 10 seconds ! I guess this is because the SQLiteDataReader is lazy loaded (and so it doesn't return the whole rowset before reading the results), am I right ?
EDIT 1 :
I don't know if this has something to do with lazy loading, like I said initially, but all I want is being able to get ALL the results as soon as the query is ended. Isn't it possible ? In my opinion, this is really strange that it takes hours to get results of a query executed in few seconds...
EDIT 2 :
I just added a COUNT(*) in my select query in order to see if I could get the total number of results at the first data.Read(), just to be sure that it was only the iteration of the results that was taking so long. And I was wrong : this new request executes in few seconds in SQLiteAdmin / SQLiteStudio, but takes hours to execute in my C# project. Any idea why the same query is so much longer to execute in my C# project?
EDIT 3 :
Thanks to EXPLAIN QUERY PLAN, I noticed that there was a slight difference in the execution plan for the same query between SQLiteAdmin / SQLiteStudio and my C# project. In the second case, it is using an AUTOMATIC PARTIAL COVERING INDEX on DB2Record instead of using the primary key index. Is there a way to ignore / disable the use of automatic partial covering indexes? I know it is used to speed up the queries, but in my case, it's rather the opposite that happens...
Thank you.

Besides finding matching records, it seems that you're also counting the number of times the strings matched. The result of this count is also used in the WHERE clause.
You want the number of matches, but the number of matches does not matter in the WHERE clause - you could try change the WHERE clause to:
WHERE MainRecord.Value LIKE '%FirstValueToSearch%' AND DB2Record.Value LIKE '%SecondValueToSearch%'
It might not result in any difference though - especially if there's no index on the Value columns - but worth a shot. Indexes on text columns require alot of space, so I wouldn't blindly recommend that.
If you haven't done so yet, place an index on the DB2's RecordID column.
You can use EXPLAIN QUERY PLAN SELECT ... to make SQLite spit out what it does to try to make your query perform, the output of that might help diagnose the problem.

Are you sure you use the same version of sqlite in System.Data.SQLite, SQLiteStudio and SQLiteAdmin ?
You can have huge differences.

One more typical reason why SQL query can take different amount of time when executed with ADO.NET and from native utility (like SQLiteAdmin) are command parameters used in CommandText (it is not clear from your code whether parameters are used or not). Depending on ADO.NET provider implementation the following identical CommandText values:
SELECT * FROM sometable WHERE somefield = ? // assume parameter is '2'
and
SELECT * FROM sometable WHERE somefield='2'
may lead to absolutely different execution plan and query performance.
Another suggestion: you may disable journal (specifying "Journal Mode=off;" in the connection string) and synchronous mode ("Synchronous=off;") as these options also may affect query performance in some cases.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.