We are using Search Index to run one of our API. The data to the index is populated using the Azure functions which pull data from the database. We could see that the number of records in the database and the Search Service is different. Is there any way to get the list of Keys in the Search Service so that we can compare with the database and see which keys are missing?
Regards,
John
The Azure Search query API is designed for search/filter scenarios, it doesn't offer an efficient way to traverse through all documents.
That said, you can do this reasonably by scanning the keys in order: if you have a field in your index (the key field or another one) that's both filterable and sortable, you can use $select to pull only the keys for each document, 1000 at a time, ordered by that field. After you retrieve the first 1000, don't do $skip (which will limit you to 100,000), instead use a filter that uses greater-than against the field, using the highest value you saw in the previous response. This will allow you to traverse the whole set at reasonable performance, although doing it 1000 at a time will take time.
You can try to search "*". And use orderby and filter to get all data by following example.
I use data metadata_storage_last_modified as filter.
offset skip time
0 --%--> 0
100,000 --%--> 100,000 getLastTime
101,000 --%--> 0 useLastTime
200,000 --%--> 99,000 useLastTime
201,000 --%--> 100,000 useLastTime & getLastTime
202,000 --%--> 0 useLastTime
Because Skip limit is 100k, so we can calculate skip by
AzureSearchSkipLimit = 100k
AzureSearchTopLimit = 1k
skip = offset % (AzureSearchSkipLimit + AzureSearchTopLimit)
If total search count will large than AzureSearchSkipLimit, then apply
orderby = "metadata_storage_last_modified desc"
When skip reach AzureSearchSkipLimit ,then get metadata_storage_last_modified time from end of data. And put metadata_storage_last_modified as next 100k search filer.
filter = metadata_storage_last_modified lt ${metadata_storage_last_modified}
Related
I am using Azure Search indexing for creating a faceted search of products. I have around 5 facets to aid in filtering the list of displayed products.
One thing I have noticed is that if there are quite a lot of products listed for filtering down using facets, smaller search items that belong within a facet do not get returned from the index.
For example (in its simplicity), if my index had the following cars manufacturers listed within a facet:
Audi (312)
BMW (203)
Volvo (198)
Skoda (4)
I would find that Skoda would not get returned, since there is such a small amount of search results linked to that manufacturer.
I can see this is the case when I search the index directly within the Azure Portal by using this query: facet=<facet-field-name>
After some research I came across the following explanation:
Facet counts can be inaccurate due to the sharding architecture. Every search index has multiple shards, and each shard reports the top N facets by document count, which is then combined into a single result. If some shards have many matching values, while others have fewer, you may find that some facet values are missing or under-counted in the results.
Although this behavior could change at any time, if you encounter this behavior today, you can work around it by artificially inflating the count: to a large number to enforce full reporting from each shard. If the value of count: is greater than or equal to the number of unique values in the field, you are guaranteed accurate results. However, when document counts are high, there is a performance penalty, so use this option judiciously.
Based on the above quote, how do I artificially inflate the count to get around this issue? Or does anyone know a better approach?
The default facet count is 10. You can specify a larger count using the count parameter as part of the facet expression. For example, assuming you're using the REST API with an HTTP GET request:
facet=myfield,count:100
If you're using the .NET SDK:
var parameters =
new SearchParameters()
{
Facets = new[] { "myfield,count:100" }
};
var results = indexClient.Documents.Search("*", parameters);
You can find more details about the facet expression syntax in the Azure Search REST API Reference.
Is it possible to "page" the results from the Analytics API?
If I use the following Query (via http POST)
{
"query":"customEvents | project customDimensions.FilePath, timestamp
| where timestamp > now(-100d) | order by timestamp desc | limit 25"
}
I get up to 10,000 results back in one result set. Is there any way to use something similar to the $skip that they have for the events API? Like "SKIP 75 TAKE 25" or something to get the 4th page of results.
[edit: this answer is now out of date, there has been a row_number function added to the query language. this answer left for historical purposes if anyone runs into strange queries that look like this answer]
Not easily.
If you can use the /events ODATA query path instead of the /query path, that supports paging. but not really custom queries like you have.
To get something like paging, you need to make a complicated query, and use summarize and makeList and invent a rowNum field in your query, then use mvexpand to re-expand the lists and then filter by the rowNum. it's pretty complicated and unintuitive, something like:
customEvents
| project customDimensions.FilePath, timestamp
| where timestamp > now(-100d)
| order by timestamp desc
// squishes things down to 1 row where each column is huge list of values
| summarize filePath=makelist(customDimensions.FilePath, 1000000)
, timestamp=makelist(timestamp, 1000000)
// make up a row number, not sure this part is correct
, rowNum = range(1,count(strcat(filePath,timestamp)),1)
// expands the single rows into real rows
| mvexpand filePath,timestamp,rowNum limit 1000000
| where rowNum > 0 and rowNum <= 100 // you'd change these values to page
i believe there's already a request on the appinsights uservoice to support paging operators in the query language.
the other assumption here is that data isn't changing in the underlying table while you're doing work. if new data appears between your calls, like
give me rows 0-99
50 new rows appear
give me rows 100-199
then step 3 is actually giving you back 50 duplicate rows that you just got in step 1.
There's a more correct way to do this now, using new operators that were added to the query language since my previous answer.
The two operators are serialize and row_number().
serialize ensures the data is in a shape and order that works with row_number(). Some of the existing operators like order by already create serialized data.
there's also prev() and next() operators that can get the values from previous or next rows in a serialized result.
When orderBy is happened on datetime with same value, I am getting different results in different hits from linq to sql .
Let us say some 15 records have the same datetime as one of their field
and if pagination is there for those 15 records and per page limit is 10 in my case, say some 10 records came on 1st run for page 1. Then for page 2 I am not getting the remaining 5 records, but some 5 records from the previous 10 records of page 1.
Question:
How this orderBy and skip and take functions are working and
Why this discrepancy in the result ?
LINQ does not play a role on how the ordering unto the underlying data source is applied. Linq itself is simply an enumerating extension. As per your comment to your question, you are asking how MSSQL applies ordering in a query.
In MSSQL (and most other RDBMS), the ordering on identical values is dependent on the underlying implementation and configuration of the RDBMS. The ordered result for such values can be perceived as random, and can change between identical queries. This does not mean you will see a difference, but you cannot rely on the data to be returned in a specific order.
This has been asked and answered before on SO, here.
This is also described in the community addon comments in this MSDN article.
No ordering is applied beyond that specified in the ORDER BY clause. If all rows have the same value, they can be returned in whatever order is fastest. That's especially evident when a query is executed in parallel.
This means that you can't use paging on results ordered by non-unique values. Each time you make a call the order can change.
In such cases you need to add tie-breaker columns that will ensure unique ordering values, eg the ID of a product ORDER BY Date, ProductID
I want to get items 50-100 back from a result set in linq. How do I do that?
Situation:
I get back an index of the last item of the last result set. I want to then grab the next 50. I do not have the ID of the last result only its index number.
You order it by Something, else you really can't
So it would be Something like
mycontext.mytable
.OrderBy(item=>.item.PropertyYouWantToOrderBy)
.Skip(HowManyYouWantToSkip)
.Take(50);
LINQ is based on the concept of one-way enumeration, so queries all start at the beginning. To implement paging you'll have to use the Skip and Take extensions to isolate the items you're interested in:
int pageSize = 50;
// 0-based page number
int pageNum = 2;
var pageContent = myCollection.Skip(pageSize * pageNum).Take(pageSize);
Of course this just sets up an IEnumerable<T> that, when enumerated, will step through myCollection 100 times, then start returning data for 50 steps before closing.
Which is fine if you're working on something that can be enumerated multiple times, but not if you're working with a source that will only enumerate once. But then you can't realistically implement paging on that sort of enumeration anyway, you need an intermediate storage for at least that portion of it that you've already consumed.
In LINQ to SQL this will result in a query that attempts to select only the 50 records you've asked for, which from memory is going to be based taking numSkip + numTake records, reversing the sort order, taking numTake records and reversing again. Depending on the sort order you've set up and the size of the numbers involved this could be a much more expensive operation than simply pulling a bunch of data back and filtering it in memory.
I'm using EntityFramework 6 and I make Linq queries from Asp.NET server to a azure sql database.
I need to retrieve the latest 20 rows that satisfy a certain condition
Here's a rough example of my query
using (PostHubDbContext postHubDbContext = new PostHubDbContext())
{
DbGeography location = DbGeography.FromText(string.Format("POINT({1} {0})", latitude, longitude));
IQueryable<Post> postQueryable =
from postDbEntry in postHubDbContext.PostDbEntries
orderby postDbEntry.Id descending
where postDbEntry.OriginDbGeography.Distance(location) < (DistanceConstant)
select new Post(postDbEntry);
postQueryable = postQueryable.Take(20);
IOrderedQueryable<Post> postOrderedQueryable = postQueryable.OrderBy(Post => Post.DatePosted);
return postOrderedQueryable.ToList();
}
The question is, what if I literally have a billion rows in my database. Will that query brutally select millions of rows which meet the condition then get 20 of them ? Or will it be smart and realise that I only want 20 rows hence it will only select 20 rows ?
Basically how do I make this query work efficiently with a database that has a billion rows ?
According to http://msdn.microsoft.com/en-us/library/bb882641.aspx Take() function has deferred streaming execution as well as select statement. This means that it should be equivalent to TOP 20 in SQL and SQL will get only 20 rows from the database.
This link: http://msdn.microsoft.com/en-us/library/bb399342(v=vs.110).aspx shows that Take has a direct translation in Linq-to-SQL.
So the only performance you can make is in database. Like #usr suggested you can use indexes to increase performance. Also storing the table in sorted order helps a lot (which is likely your case as you sort by id).
Why not try it? :) You can inspect the sql and see what it generates, and then look at the execution plan for that sql and see if it scans the entire table
Check out this question for more details
How do I view the SQL generated by the Entity Framework?
This will be hard to get really fast. You want an index to give you the sort order on Id but you want a different (spatial) index to provide you with efficient filtering. It is not possible to create an index that fulfills both goals efficiently.
Assume both indexes exist:
If the filter is very selective expect SQL Server to "select" all rows where this filter is true, then sorting them, then giving you the top 20. Imagine there are only 21 rows that pass the filter - then this strategy is clearly very efficient.
If the filter is not at all selective SQL Server will rather traverse the table ordered by Id, test each row it comes by and outputs the first 20. Imagine that the filter applies to all rows - then SQL Server can just output the first 20 rows it sees. Very fast.
So for 100% or 0% selectivity the query will be fast. In between there are nasty mixtures. If you have that this question requires further thought. You probably need more than a clever indexing strategy. You need app changes.
Btw, we don't need an index on DatePosted. The sorting by DatePosted is only done after limiting the set to 20 rows. We don't need an index to sort 20 rows.