I have 2 separate systems - a document management system and a sharepoint search server.
Both systems have an api that I can use to search the data within them. Tthe same data may exist in both systems, so we much search both.
What is the most efficient way (speed is very important) to search both api's at the same time and merge the results together.
Is the following idea bad/good/slow/fast:
user enters search terms
the api for each system is called on it's own thread
the results from each api is placed in a common IEnumerable class of same type
when both threads have executed linq is used to join the 2 IEnumerable result objects together
results are passed to view
The application is ASP.NET MVC C#.
Your solution looks alright - you're using the adapter pattern to convert the two different result feeds into your required format, and the overall design is a facade pattern. From a design point of view, your solution is valid.
If you wanted to make things even better, you could display the results as soon as they arrive and display a notification that results are still loading until all APIs have returned a value. If your document management system was significantly faster or slower than sharepoint, it would give the user information faster that way.
I don't see anything wrong in the way you are doing it. Your algorithm could take forever to produce the perfect results, but you need to strike a balance. Any optimisation would have to be done in the search algorithm (or rather document indexing). You would still have to compromise on how many hits are good enough to the user by limiting the duration of your thread execution.
Related
I want to create a search box that will show result relating to the typed the text. I am using .NET MVC and I have been stuck on this for awhile. I want to use the AlphaVantage API search endpoint to create this.
It would like this. I just don't know what component to use or how to implement it.
As we don't know amount of your data and possible stack/budget in your project, autocompletion/autosuggestion could be implemented differently:
In memory (you break your word into all possible prefixes and map them to your entity through dictionary, could be optimized, like so - https://github.com/omerfarukz/autocomplete). Limit is around 10 million entries, a lot of memory consumption. Also support some storage mechanics, but I don't think it is more powerfull than fully fledged Lucene.
In Lucene index (Lucene.Net (4.8) AutoComplete / AutoSuggestion). Limited to 2 billions, very optimized usage of memory, stored on hard drive or anywhere else. Hard to work with, because provide low-level micro optimizations on indexes and overall pipeline of tokenization/indexing.
In Elasticsearch cluster (https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/suggest-usage.html). Unlimited, uses Lucene indexes as sharding units. Same as lucene, but every cloud infrastructure provide it for pretty penny.
In SQL using full text index (SQL Server Full Text Catalog and Autocomplete). Limited by database providers such as SQLite/MSSQL/Oracle/etc, cheap, easy to use, usually consumes CPU as there is no tomorrow, but hey, it is relational, so you could join any data to results.
As to how to use it - basically you send request to desired framework instance and retrieve first N results, which then you serve in some REST GET response.
You'll have to make a POST request (HttpClient) to the API that will return your data. You'll also need to provide all required authorization information (whether headers or keys). That would need to be async or possibly a background worker so that it doesn't block your thread. The requests need to happen when there's a change in your search box.
You can probably find details on how to do the request here.
I'm currently looking at putting together a .NET role provider for couchbase for use in a project. I'm looking to model this in 2 document types.
The first document type is a simple list of all of the roles available in the application. This makes it easy to support Add, Delete, GetAllRoles etc. This document would have a fixed key per application, so "applicationname_roles" so it is well known from the codes point of view and quickly retrievable.
The second document maps a user to a role, so for example
{
"type": "roleprovider.user-role",
"user": "user1",
"role": "role1",
"application": "app1"
}
The key for this document type would be of the format "applicationname_roles_username_rolename", making the most common operation of testing if a user is in a particular role trivial and quick.
To support the GetRolesForUser or GetUsersInRole methods of the .NET role provider I'm looking at using a view of the form.
function (doc, meta) {
if(meta.type != 'json')
{
return;
}
if (doc.type == "roleprovider.user-role")
{
if(doc.application && doc.user && doc.role)
{
emit([doc.application, "user:" + doc.user, doc.role]);
emit([doc.application, "role:" + doc.role, doc.user]);
}
}}
So for every user to role mapping we get 2 rows emitted into the view. The first allows us to query the view for what roles a user is in. The second for which users are in a role. The .NET provider simply needs to prefix either "user:" or "role:" based on wether or not its querying GetRolesForUser or GetUsersInRole to filter down on what it needs.
So now to the question, this all seems reasonably trivial and logical, however its the first time I've worked with Couchbase and wondered if I was falling into any traps with this? An obvious alternative approach would be to use 2 views, but in my reading I've seen it mentioned that its best to keep the number of design documents down and the number of views within those down as well, see Perry Krug's reply in couchbase views per bucket discussion, in this he mentions trying to 'generate multiple different queries off of one index'. So basically I'm wondering if the approach I've described above is prescribing to what Perry is saying, or if I'm just tricking myself and going to cause myself pain down the line.
Thanks for any pointers.
(Note: resurrecting an ancient question because it's not been answered yet, and it might interest someone else.)
Your approach in general is sound. But, unless you're actually experiencing performance issues, I would stick to one view per query type in this case. While combining multiple queries into a single view will reduce the amount of work Couchbase needs to do to build the views, it will increase the cost of each query, as it will have to scan an index that's twice as large. If you're not using many other views at the same time, I would keep the views separate. In fact, I would even put them in different design documents, so that Couchbase will process them concurrently by different indexer threads. There's no need to prematurely optimize for a performance problem that doesn't exist yet; go with the straightforward solution first, then optimize if necessary.
If you do run into a performance problem with query reads, you may need to consider moving from views to a key/value based approach. Specifically, storing a separate document for every application-user and application-role pair, and appending a list of roles to the former and users to the latter. This means that you essentially end up maintaining the indexes yourself, but it will give you an order of magnitude improvement in read latency. Take a look at this blog about maintaining sets with append operations.
Yes, I do see the irony in advocating against premature optimization in one paragraph and suggesting a performance optimization in the next. So I'd like to emphasize that you should first test whether the view approach gives you acceptable performance - and for most reasonable applications it will - and if it does, stick to that because it's much more straightforward. If you discover that you need better performance after all, then start looking into the second alternative.
I wonder if somebody could point me in the right direction. I've recently started playing with LinqToSQL and love the strongly typed data objects etc.
I'm just struggling to understand the impact on database performance etc. For example, say I was developing a simple user profile page. The page shows basic information about the user, some information on their recent activity, and a list of unread notifications.
If I was developing a stored procedure for this page, I could create a single SP which returns multiple datatables covering all of the required information - resulting in a single db call.
However, using LinqToSQL, this could results in many calls - one for user info, atleast one for activity, atleast one for notifications, if I then want further info on notifications this may result in further calls - multiple db calls.
Should I be worried about the number of db calls happenning as a result of using this design pattern? Ie, are the multiple db handshakes etc going to degrade my db etc?
I'd appreciate your thoughts on this!
Thanks
David
LINQ to SQL can consume multiple results from a stored proc if you need to go that route. Unfortnately the designer has problems mapping them correctly, so you will probably need to create your mapping manually. See http://www.thinqlinq.com/Default/Using-LINQ-to-SQL-to-return-Multiple-Results.aspx.
You can configure LINQ to SQL to eagerly load the child records if you know that you're going to need them for every parent record. Use the DataLoadOptions and .LoadWith to configure it.
You can also project an object graph with multiple child collections in the Select clause of a LINQ query to reduce the number of DB hits that you make.
Ultimately, you need to check a number of options to determine which route is the best performance for your situation. It's not a one size fits all scenario.
Is it worst from a performance standpoint ? Yes, it should be. Multiple roundtrips are usually worse than single.
The real question is, do you mind? Is your application going to receive enough visits to warrant the added complexity of a stored procedure? Or do you value the simplicity of future modifications over raw performance?
In any case, if you need the performance, you can create a stored procedure and map it on your context. This will give you one single call, but return the data as objects
Here is an article explaining a bit about that option:
linq-to-sql-returning-multiple-result-sets
I've been implementing MS Search Server 2010 and so far its really good. Im doing the search queries via their web service, but due to the inconsistent results, im thinking about caching the result instead.
The site is a small intranet (500 employees), so it shouldnt be any problems, but im curious what approach you would take if it was a bigger site.
I've googled abit, but havent really come over anything specific. So, a few questions:
What other approaches are there? And why are they better?
How much does it cost to store a dataview of 400-500 rows? What sizes are feasible?
Other points you should take into consideration.
Any input is welcome :)
You need to employ many techniques to pull this off successfully.
First, you need some sort of persistence layer. If you are using a plain old website, then the user's session would be the most logical layer to use. If you are using web services (meaning session-less) and just making calls through a client, well then you still need some sort of application layer (sort of a shared session) for your services. Why? This layer will be home to your database result cache.
Second, you need a way of caching your results in whatever container you are using (session or the application layer of web services). You can do this a couple of ways... If the query is something that any user can do, then a simple hash of the query will work, and you can share this stored result among other users. You probably still want some sort of GUID for the result, so that you can pass this around in your client application, but having a hash lookup from the queries to the results will be useful. If these queries are unique then you can just use the unique GUID for the query result and pass this along to the client application. This is so you can perform your caching functionality...
The caching mechanism can incorporate some sort of fixed length buffer or queue... so that old results will automatically get cleaned out/removed as new ones are added. Then, if a query comes in that is a cache miss, it will get executed normally and added to the cache.
Third, you are going to want some way to page your result object... the Iterator pattern works well here, though probably something simpler might work... like fetch X amount of results starting at point Y. However the Iterator pattern would be better as you could then remove your caching mechanism later and page directly from the database if you so desired.
Fourth, you need some sort of pre-fetch mechanism (as others suggested). You should launch a thread that will do the full search, and in your main thread just do a quick search with the top X number of items. Hopefully by the time the user tries paging, the second thread will be finished and your full result will now be in the cache. If the result isn't ready, you can just incorporate some simple loading screen logic.
This should get you some of the way... let me know if you want clarification/more details about any particular part.
I'll leave you with some more tips...
You don't want to be sending the entire result to the client app (if you are using Ajax or something like an IPhone app). Why? Well because that is a huge waste. The user likely isn't going to page through all of the results... now you just sent over 2MB of result fields for nothing.
Javascript is an awesome language but remember it is still a client side scripting language... you don't want to be slowing the user experience down too much by sending massive amounts of data for your Ajax client to handle. Just send the prefetched result your client and additional page results as the user pages.
Abstraction abstraction abstraction... you want to abstract away the cache, the querying, the paging, the prefetching... as much of it as you can. Why? Well lets say you want to switch databases or you want to page directly from the database instead of using a result object in cache... well if you do it right this is much easier to change later on. Also, if using web services, many many other applications can make use of this logic later on.
Now, I probably suggested an over-engineered solution for what you need :). But, if you can pull this off using all the right techniques, you will learn a ton and have a very good base in case you want to extend functionality or reuse this code.
Let me know if you have questions.
It sounds like the slow part of the search is the full-text searching, not the result retrieval. How about caching the resulting resource record IDs? Also, since it might be true that search queries are often duplicated, store a hash of the search query, the query, and the matching resources. Then you can retrieve the next page of results by ID. Works with AJAX too.
Since it's an intranet and you may control the searched resources, you could even pre-compute a new or updated resource's match to popular queries during idle time.
I have to admit that I am not terribly familiar with MS Search Server so this may not apply. I have often had situations where an application had to search through hundreds of millions of records for result sets that needed to be sorted, paginated and sub-searched in a SQL Server though. Generally what I do is take a two step approach. First I grab the first "x" results which need to be displayed and send them to the browser for a quick display. Second, on another thread, I finish the full query and move the results to a temp table where they can be stored and retrieved quicker. Any given query may have thousands or tens of thousands of results but in comparison to the hundreds of millions or even billions of total records, this smaller subset can be manipulated very easily from the temp table. It also puts less stress on the other tables as queries happen. If the user needs a second page of records, or needs to sort them, or just wants a subset of the original query, this is all pulled from the temp table.
Logic then needs to be put into place to check for outdated temp tables and remove them. This is simple enough and I let the SQL Server handle that functionality. Finally logic has to be put into place for when the original query changes (significant perimeter changes) so that a new data set can be pulled and placed into a new temp table for further querying. All of this is relatively simple.
Users are so used to split second return times from places like google and this model gives me enough flexibility to actually achieve that without needing the specialized software and hardware that they use.
Hope this helps a little.
Tim's answer is a great way to handle things if you have the ability to run the initial query in a second thread and the logic (paging / sorting / filtering) to be applied to the results requires action on the server ..... otherwise ....
If you can use AJAX, a 500 row result set could be called into the page and paged or sorted on the client. This can lead to some really interesting features .... check out the datagrid solutions from jQueryUI and Dojo for inspiration!
And for really intensive features like arbitrary regex filters and drag-and-drop column re-ordering you can totally free the server.
Loading the data to the browser all at once also lets you call in supporting data (page previews etc) as the user "requests" them ....
The main issue is limiting the data you return per result to what you'll actually use for your sorts and filters.
The possibilities are endless :)
I've been tasked with seeting up a search service on an eCommerce site.
Currently, it uses full text indexing on sql server, which isn't ideal, as it's slow, and not all that flexible.
How would you suggest i approach changing this over to lucene?
By that, i mean, how would i initially load all the data into the indexes, and how would it be maintained? on my "insert product" methods, would i also have it insert it into the index?
any information is of great help!
I'm currently using Solr, which is build on top of Lucene, as the search engine for one of my e-commerce projects. It works great.
http://lucene.apache.org/solr/
Also as far as keeping the products in sync between the DB and Solr, you can either build your own "sweeper" or implement the DataImportHandler in Solr.
http://wiki.apache.org/solr/DataImportHandler
We build our own sweeper that reads a DB view at some interval and checks to see if there are new products or any product data has been updated. It's a brute force method and I wish I knew about the DataImportHandler before.
Facets are also a really powerful part of Solr. I highly recommend using them.
If you do decide to use Lucene.NET for your search you need to do some of the following:
create your initial index by
iterating through all your records
and writing the data that you want
searched into your index
if the amount of records and data that you are writing to your indexes makes for large indexes then consider stuffing them into multiple indexes (this means you will have to make a more complex search program as you need to search each index and then merge the results!!)
when a product is updated or created you need to update your index (there is a process here to create additional index parts and then merge the indexes)
if you have a high traffic site and there is the possibility of multiple searches occurring at the exact same moment then you need to create a wrapper that can do the search for you across multiple duplicate indexs (or sets of indexes) (think singleton pattern here) as the index can only be accessed (opened) for one search at a time
This is a great platform. We initially tried to use the freetext search and found it to be a pain to create the indexes, update, and manage. The searches were not that much faster than a standard sql search. They did provide some flexibility in the search query...but even this pales in comparison to the power of Lucene!