Storing search result for paging and sorting - c#

I've been implementing MS Search Server 2010 and so far its really good. Im doing the search queries via their web service, but due to the inconsistent results, im thinking about caching the result instead.
The site is a small intranet (500 employees), so it shouldnt be any problems, but im curious what approach you would take if it was a bigger site.
I've googled abit, but havent really come over anything specific. So, a few questions:
What other approaches are there? And why are they better?
How much does it cost to store a dataview of 400-500 rows? What sizes are feasible?
Other points you should take into consideration.
Any input is welcome :)

You need to employ many techniques to pull this off successfully.
First, you need some sort of persistence layer. If you are using a plain old website, then the user's session would be the most logical layer to use. If you are using web services (meaning session-less) and just making calls through a client, well then you still need some sort of application layer (sort of a shared session) for your services. Why? This layer will be home to your database result cache.
Second, you need a way of caching your results in whatever container you are using (session or the application layer of web services). You can do this a couple of ways... If the query is something that any user can do, then a simple hash of the query will work, and you can share this stored result among other users. You probably still want some sort of GUID for the result, so that you can pass this around in your client application, but having a hash lookup from the queries to the results will be useful. If these queries are unique then you can just use the unique GUID for the query result and pass this along to the client application. This is so you can perform your caching functionality...
The caching mechanism can incorporate some sort of fixed length buffer or queue... so that old results will automatically get cleaned out/removed as new ones are added. Then, if a query comes in that is a cache miss, it will get executed normally and added to the cache.
Third, you are going to want some way to page your result object... the Iterator pattern works well here, though probably something simpler might work... like fetch X amount of results starting at point Y. However the Iterator pattern would be better as you could then remove your caching mechanism later and page directly from the database if you so desired.
Fourth, you need some sort of pre-fetch mechanism (as others suggested). You should launch a thread that will do the full search, and in your main thread just do a quick search with the top X number of items. Hopefully by the time the user tries paging, the second thread will be finished and your full result will now be in the cache. If the result isn't ready, you can just incorporate some simple loading screen logic.
This should get you some of the way... let me know if you want clarification/more details about any particular part.
I'll leave you with some more tips...
You don't want to be sending the entire result to the client app (if you are using Ajax or something like an IPhone app). Why? Well because that is a huge waste. The user likely isn't going to page through all of the results... now you just sent over 2MB of result fields for nothing.
Javascript is an awesome language but remember it is still a client side scripting language... you don't want to be slowing the user experience down too much by sending massive amounts of data for your Ajax client to handle. Just send the prefetched result your client and additional page results as the user pages.
Abstraction abstraction abstraction... you want to abstract away the cache, the querying, the paging, the prefetching... as much of it as you can. Why? Well lets say you want to switch databases or you want to page directly from the database instead of using a result object in cache... well if you do it right this is much easier to change later on. Also, if using web services, many many other applications can make use of this logic later on.
Now, I probably suggested an over-engineered solution for what you need :). But, if you can pull this off using all the right techniques, you will learn a ton and have a very good base in case you want to extend functionality or reuse this code.
Let me know if you have questions.

It sounds like the slow part of the search is the full-text searching, not the result retrieval. How about caching the resulting resource record IDs? Also, since it might be true that search queries are often duplicated, store a hash of the search query, the query, and the matching resources. Then you can retrieve the next page of results by ID. Works with AJAX too.
Since it's an intranet and you may control the searched resources, you could even pre-compute a new or updated resource's match to popular queries during idle time.

I have to admit that I am not terribly familiar with MS Search Server so this may not apply. I have often had situations where an application had to search through hundreds of millions of records for result sets that needed to be sorted, paginated and sub-searched in a SQL Server though. Generally what I do is take a two step approach. First I grab the first "x" results which need to be displayed and send them to the browser for a quick display. Second, on another thread, I finish the full query and move the results to a temp table where they can be stored and retrieved quicker. Any given query may have thousands or tens of thousands of results but in comparison to the hundreds of millions or even billions of total records, this smaller subset can be manipulated very easily from the temp table. It also puts less stress on the other tables as queries happen. If the user needs a second page of records, or needs to sort them, or just wants a subset of the original query, this is all pulled from the temp table.
Logic then needs to be put into place to check for outdated temp tables and remove them. This is simple enough and I let the SQL Server handle that functionality. Finally logic has to be put into place for when the original query changes (significant perimeter changes) so that a new data set can be pulled and placed into a new temp table for further querying. All of this is relatively simple.
Users are so used to split second return times from places like google and this model gives me enough flexibility to actually achieve that without needing the specialized software and hardware that they use.
Hope this helps a little.

Tim's answer is a great way to handle things if you have the ability to run the initial query in a second thread and the logic (paging / sorting / filtering) to be applied to the results requires action on the server ..... otherwise ....
If you can use AJAX, a 500 row result set could be called into the page and paged or sorted on the client. This can lead to some really interesting features .... check out the datagrid solutions from jQueryUI and Dojo for inspiration!
And for really intensive features like arbitrary regex filters and drag-and-drop column re-ordering you can totally free the server.
Loading the data to the browser all at once also lets you call in supporting data (page previews etc) as the user "requests" them ....
The main issue is limiting the data you return per result to what you'll actually use for your sorts and filters.
The possibilities are endless :)

Related

server-side caching strategy for user filtered results

I have a situation at www.zipstory.com (beta) where I have n-permutations of feeds coming from the same database. For example, someone can get a feed for whatever city they are interested in and as many of these cities all together so its all sorted by most recent or most votes.
How should I cache things for each user without completely maxing out the available memory when there are thousands of users at the same time?
My only guess is don't. I could come up with a client-side caching strategy where I sort out the cities results but this way I could still cache in a one-size fits all strategy by city.
What approaches do you suggest? I'm in unfamiliar ground at this point and could use a good strategy. I noticed this website does not do that but Facebook does. They must be pulling from a pool of cached user-feeds and plucking them in client-side. Not sure, again I'm not smart enough to figure this out just yet.
In other words...
Each city has its own feed. Each user has an n-permutation of city feeds combined.
I would like to see possible solutions to this problem using c# and ASP.NET
Adding to this Febuary 28th, 2013. Here's what I did based on your comments so THANKS!...
For every user logging in, I cache their preferred city list
The top 10 post results are cached per city and stored in a Linq based object
When a user comes in and has x cities as feeds, I go through their city list loop then check if city postings are in cache, if not, I get from DB then populate the individual posting's html into cache along with other sorting elements.
I recombine the list of cities into one feed for the user and since i have some sorting elements on the linq object, i can resort them in the proper order and give back to the user
This does mean there is some CPU work everytime regardless as I have to combine city lists into a single city list but this avoids going to the database every time and everyone benefits in faster page response times. The main drawback is since I'm not doing a single query UNION on cities before, this requires a single query per city if each was not cached but each city is checked if cached or not individually so 10 queries per 10 cities would only happen if the site is a dead zone.
Judge the situation based on critical chain points.
If the memory is not a problem, consider having the whole feed cached and retrieving items from there. For this you can use distributed cache solutions. Some of them are even free. Start from the memcached, http://memcached.org/ . People refer to this approach as Load Ahead.
Sometimes memory is a problem if you want to use asp.net cache with expiration and priorities. In such a case, cache can be gone at any point when the memory becomes a problem. Thus, you load the data again on demand (called as Load Through) that affects bandwidth. In such a case, your code should be smarter to get along. If this is an option for you, then try to cache as little as possible. e.g. cache loaded items each and when the user requests a feed, check if all items are present in the cache. If not, you will have to fetch either all or missing ones again. I have done something similar in the past, but cannot provide the code. Key point is: cache entities and then cache feeds with references (IDs) to the entities. Thus, when a particular feed is requested, you check that all references are still valid in the cache. BTW, asp.net provides cache dependencies for such scenarios, so read about that too, - may be helpful.
In any case, have Decorator design pattern in mind when implementing data access layer, that would allow you to: 1 - postpone the caching concerns for the later development phases, and 2 - switch between the two approaches described above depending on how things go. I would start with simpler (and cheaper) built-in solution, and then would switch to the distributed cache solutions when really needed.
Only cache the minimal amount of different information per user that you need.
For example, if it fits in memory, cache the complete set of feeds and only store, per user, the id's of the feeds they are interested in.
When they request their feeds, just get those out of memory.
Have you considered cache-ing generic feeds, and tag them. And then per user, you just store reference to that tag/keyword.
Another possibility could be to store generic feed, and then filter on client. This will increase your bandwidth, but save cost on cache.
And if you are on HTML5, use Local Storage to hold user preference.

LinqToSQL Design Query / Worry

I wonder if somebody could point me in the right direction. I've recently started playing with LinqToSQL and love the strongly typed data objects etc.
I'm just struggling to understand the impact on database performance etc. For example, say I was developing a simple user profile page. The page shows basic information about the user, some information on their recent activity, and a list of unread notifications.
If I was developing a stored procedure for this page, I could create a single SP which returns multiple datatables covering all of the required information - resulting in a single db call.
However, using LinqToSQL, this could results in many calls - one for user info, atleast one for activity, atleast one for notifications, if I then want further info on notifications this may result in further calls - multiple db calls.
Should I be worried about the number of db calls happenning as a result of using this design pattern? Ie, are the multiple db handshakes etc going to degrade my db etc?
I'd appreciate your thoughts on this!
Thanks
David
LINQ to SQL can consume multiple results from a stored proc if you need to go that route. Unfortnately the designer has problems mapping them correctly, so you will probably need to create your mapping manually. See http://www.thinqlinq.com/Default/Using-LINQ-to-SQL-to-return-Multiple-Results.aspx.
You can configure LINQ to SQL to eagerly load the child records if you know that you're going to need them for every parent record. Use the DataLoadOptions and .LoadWith to configure it.
You can also project an object graph with multiple child collections in the Select clause of a LINQ query to reduce the number of DB hits that you make.
Ultimately, you need to check a number of options to determine which route is the best performance for your situation. It's not a one size fits all scenario.
Is it worst from a performance standpoint ? Yes, it should be. Multiple roundtrips are usually worse than single.
The real question is, do you mind? Is your application going to receive enough visits to warrant the added complexity of a stored procedure? Or do you value the simplicity of future modifications over raw performance?
In any case, if you need the performance, you can create a stored procedure and map it on your context. This will give you one single call, but return the data as objects
Here is an article explaining a bit about that option:
linq-to-sql-returning-multiple-result-sets

database, requests, performance, cache

I need some input on how to design a database layer.
In my application I have a List of T. The information in T have information from multiple database tables.
There are of course multiple ways to do this.
Two ways that I think of is :
chatty database layer and cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = GetSomeValueFromCacheOrDb(dataRow["someId2"])
});
}
The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that.
Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems. Which we have to handle manually.
The good thing is that it's highly cacheable.
non chatty but not cacheable:
List<SomeX> list = new List<SomeX>();
foreach(...) {
list.Add(new SomeX() {
prop1 = dataRow["someId1"],
prop2 = dataRow["someValue"]
});
}
The problem that I see with the above is that its hard to cache, since potentially all users have unique lists. The other problem is that it will be a lot of joins which could result in a lot of reads against the database.
The good thing is that we know for sure that all information exists after the query is run (inner join etc)
non so chatty, but still cacheable
A third option could be to first loop through the data rows, and collect all necessary someId2 and then make one more database request to get all the SomeId2 values.
"The problem that I see with the above is that if we want a list of 500 items, it could potentially make 500 database requests. With all the network latency and that."
True. Could also create unnecessary contention and consume server resources maintaining locks as you iterate over a query.
"Another problem is that the users could have been deleted after we got the list from the database but before we are trying to get it from cache/db, which means that we will have null-problems."
If I take that quote, then this quote:
"The good thing is that it's highly cacheable."
Is not true, because you've cached stale data. So strike off the only advantage so far.
But to directly answer your question, the most efficient design, which seems to be what you are asking, is to use the database for what it is good for, enforcing ACID compliance and various constraints, most notably pk's and fk's, but also for returning aggregated answers to cut down on round trips and wasted cycles on the app side.
This means you either put SQL into your app code, which has been ruled to be Infinite Bad Taste by the Code Thought Police, or go to sprocs. Either one works. Putting the code into the App makes it more maintainable, but you'll never be invited to any more elegant OOP parties.
Some suggestions:
SQL is a set based language, so don't design things for iterating over loops. Even with stored procedures, still see cursors now and then when a set based query will solve the issue. So, always try and get the information with 1 query. Now sometimes this isn't possible but in the majority this will be. You can also design Views to make your querying easier if you have a schema with many tables to pull the information that is needed with one statement.
Use proxies. Let's say I have an object with 50 properties. At first you display a list of objects to the user. In this case, I would create a proxy of the most important properties and display that to the user, maybe 2 or three important ones like name, ID, etc. This cuts down on amount of information sent initially. When the user actually wants to edit or change the object, then make a second query to get the "full" object. Only get what you need. This is especially important over the web when serialization XML between the layers.
Come up with a paging strategy. Most systems work fine until they get a lot of data and then the query comes to a halt because it is reurning 1000s of data rows/records. Page early and often. If you are doing a web application, probably paging directly in the database will be the most performant because only the paged data is being sent between the layers.
Data caching depends on the data. For highly volatile data (changing all the time) caching isn't worth it. By for semi-volatile or non-volatile data, caching can be worth it, but you have to manage the cache either directly or indirectly if you are using a built in framework.
A good place to use a cache is say you have a zip codes table. Certianly, those don't change that often and you could cache those to boost performance if you had a zip code drop down in your application. This is just an example, but caching IMO depends on the type of data.

How can I cache LINQ to SQL results safely?

I have an ASP.Net MVC web app that includes a set of Forums. In order to maintain flexible security, I have chosen an access-control-list style of security.
However, this is getting to be a pretty heavy chunk of data to retrieve every time somebody views the forum index.
I am using the EnterpriseLibrary.Caching functionality to cache various non-LINQ items on the site. (The BBCode interpreter, the Skins, and etc.)
My question is this:
What is the safest and most elegant way to cache a LINQ result?
Essentially, I would like to keep a copy of the ACL for each forum in memory to prevent the database hit. That way, for each person that hits the site, at most I would have to fetch group membership information.
All-in-all I'm really looking for a way to cache large abouts of LINQ data effectively, not just these specific rows.
If you've already got a caching system for general objects, all you should need is this:
var whatever = linkQuery.ToList();

Joining results from separate API's together

I have 2 separate systems - a document management system and a sharepoint search server.
Both systems have an api that I can use to search the data within them. Tthe same data may exist in both systems, so we much search both.
What is the most efficient way (speed is very important) to search both api's at the same time and merge the results together.
Is the following idea bad/good/slow/fast:
user enters search terms
the api for each system is called on it's own thread
the results from each api is placed in a common IEnumerable class of same type
when both threads have executed linq is used to join the 2 IEnumerable result objects together
results are passed to view
The application is ASP.NET MVC C#.
Your solution looks alright - you're using the adapter pattern to convert the two different result feeds into your required format, and the overall design is a facade pattern. From a design point of view, your solution is valid.
If you wanted to make things even better, you could display the results as soon as they arrive and display a notification that results are still loading until all APIs have returned a value. If your document management system was significantly faster or slower than sharepoint, it would give the user information faster that way.
I don't see anything wrong in the way you are doing it. Your algorithm could take forever to produce the perfect results, but you need to strike a balance. Any optimisation would have to be done in the search algorithm (or rather document indexing). You would still have to compromise on how many hits are good enough to the user by limiting the duration of your thread execution.

Categories

Resources