I'm basically looking for a 'google type' search of my database.
I'm currently creating a application which stores books (and authors), Games Movies (and more in the future). The application, obviously, also needs to be able to quickly search the database for any of these items.
Of course simply splitting up the games, books and movie searches is no problem, though I would really find it awesome if I had 1 search field for everything, mainly because I sometimes confuse books with movies xD
Now at first I thought this would be a nice way to go about it (simply only searching for books):
List<Book> books = (from b in le.Book
where (b.Title + " " + b.Author.FirstName + " " +
b.Author.Surname).Contains(search)
select b).OrderBy(b => b.Title).ToList();
This is easy, and works fine with a small database and when you type the search in the right order.
so using this a search would look like:
The fault in our stars John Green
but if someone were to type:
John Green The fault in our stars
The fault in our stars - John Green
or what ever variation you can come up with, it would fail.
I did find quite a nice example for a SQL query here: MYSQL search fields method but it's in SQL and I do not know how to re-write this to linq. Because the database (is going to) contain thousands of records, so I can't just do:
var total = (from b in le.Book
select new { b.ID, FullDescription = (b.Title + " " +
b.Author.FirstName + " " + b.Author.Surname) });
string[] searchArr = search.split(' ');
List<int> ids = new List<int>();
foreach(string s in searchArr)
{
ids.addRange((from t in total
where t.FullDescription.Contains(s).ToList());
}
The foreach loop would slow it down too much (I know there must be a better way to create a variable number of where statements but I don't know how to do that either).
But yeh the var total would become huge.
Then of course there is the part of making it a live search so it updates the list view every time a character is typed so if I type: "jo" I would get a list with results, then I can define it further by typing "joh" but would it be better to query the List of results I got from the last query or to re-query the whole database?
Also I need to take into account the Backspace, so if someone typed "jo" but wanted "ja" I need to re-query the entire database anyway right?
So what is the best practice for doing this? I've found quite some examples like the one mentioned but I'm searching for the fastest and "user proof" (meaning no matter how strange the search it still needs to come up with the right result)
My database model (only containing books, authors)
P.S. I am not the best database designer so if you find something you would do different let me know (still got a lot to learn)
You are asking an incredibly deep question, and I don't think there is a "right" answer but I do think there are "good" and "bad" approaches given your requirements and assumptions.
Fundamentally you are trying to accomplish the following:
Given a particular query string, you want to determine an ordering on your data row R
This ordering should be deterministic
This ordering should be easy to calculate
This ordering should reflect similarity or relevance between your search string and the members of R
You must first accept that unless we define the problem better, this is more of an art than a science. "Relevance" here is not well-defined. However, we can make some common-sense assumptions about what might be relevant. For instance, we might say that a relevant result has these qualities:
The search string is included in the members of R
More members of R with the search string indicates a more relevant result
Certain members of R are more important than others
We should allow for typos/mistakes - ie, a partial match is worth something
Then we can determine a "score" for R row as follows:
Each member of R gets a "weight" with a minimum value of 1 and no maximum value
The score for R is equal to the sum of the weight of each member divided by the "distance" between the member and the query string
The distance is defined in terms of a well-known string distance metric like Levenshetin or SoundEx
For instance, if your R has members Name, Description, and URL, you might weight these 100, 10, and 1, respectively, and apply the Levenshtein metric.
This is not even close to the tip of the iceberg, for it would be such a poor algorithm that it would be useless. Better approaches include cross-referencing members of your data row, looking up members against known dictionaries, and developing an evidence-based model for scoring results.
But scoring is a valid way to reduce the problem into an easier-to-state one.
Related
I am implementing a "deep search" feature within my MVC5 application. It queries many EF entities for keyword (string) matches.
It works great (thanks Linq!) except I need to perform some advanced string manipulation now.
The issue is that my "item.Description" field can be rather large sometimes. I just need to grab the first sentence from the description that contains the matching keyword.
I need to limit the result of that to 250 characters, so the UI doesn't look crazy. My controller returns an IQueryable object, which is just a class with 4 string properties.
Here are some examples of what the Item Description may look like:
"Represent clients in criminal and civil litigation and other legal proceedings, draw up legal documents, or manage or advise clients on legal transactions. May specialize in a single area or may practice broadly in many areas of law."
"Teach courses in criminal justice, corrections, and law enforcement administration. Includes both teachers primarily engaged in teaching and those who do a combination of teaching and research."
"Teach courses in law. Includes both teachers primarily engaged in teaching and those who do a combination of teaching and research."
Here is my LINQ code I am using to filter down my result set:
var keyword = "law";
var itemsByTitle = db.Items.Where(item => item.Description.Contains(keyword)).Select(item => new GlobalSearchResult { Id = item.Id, Match = item.Description, Title = item.Title, Type = "item" });
I need to expand on item.Description to do this advanced string/sentence manipulation.
The ideal results would/should be:
"May specialize in a single area or may practice broadly in many areas of law."
"Teach courses in criminal justice, corrections, and law enforcement administration."
"Teach courses in law."
I saw this example but it is for Linq to Objects, and I believe I am using Linq to Entities? The syntax is a bit different, although related.
https://msdn.microsoft.com/en-us/library/mt693048.aspx
Is this advanced string/sentence manipulation possible in a single LINQ Entity query? If so, how can I get started? I am unsure if I should use regex here? Or a 2nd object to intersect on?
let me try to explain this
I have two lists
1. list of employee objects
2.list of department objects(which has list of employee who can work in the department)
I want to be able to add a employee to a department in the list which has list of employees.
but I am getting null error
int empsize = allemployees.Count;
int Maxdepartment = 0;
foreach (employee employeeitem in allemployees)
{
Maxdepartment = employeeitem.alloweddepartments.Count;
for (int i = 0; i < Maxdepartment; i++)
{
int index = alldepartments.FindIndex(x => x.Name == employeeitem.alloweddepartments[i].ToString());
alldepartments[index].earlyshift.Add(employeeitem);
}
This looks like a very complex problem for me. Its for sure a optimization problem with many constraints, so i hope you are good at math ;-).
I would suggest to have a look at the Simplex Algorithm which will work very well for your problem if you have the mathematical know how to use it. There are some variations of the simplex too which maybe also work well.
There is an other way too where you just use the power of your computer to solve the problem. You could write a function which rates a solution and gives you some kind of benchmarkscore.
For instance you can rate the hour difference of provided and needed like every hour which is smaller then the needed is a -2, every hour which is bigger then needed is -1. So you can get a score to an employee assignment.
with this function you can start to randomly assign employees to departments (of course accourding to the min/max employees for each department) and then rate this solution using your function. so you can find a solution with a good score (if your defined function works well)
most of random assignments will be stupid of course but your computer generates million of solutions in seconds so chances are good that it will generate a good one after some time (dont think time is a big critera here cause once you have a solution it wont change very often)
I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.
Effectively, the code does this:
var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) &&
r.year == record.year &&
r.period == record.period).FirstOrDefault();
cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.
Obviously, this is REALLY slow.
I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").
Are there any other projects (open source or commercial) that do this sort of thing?
EDIT:
I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.
I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.
Thanks.
Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).
If you want to quickly match items together, use .ToLookup.
var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});
foreach(var group in lookup)
{
// do something with items in group.
}
Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.
var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))
Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.
Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:
// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
public int year;
public int period;
}
and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.
The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.
Finally, a free micro-optimization would be to take the TrimEnd out of the loop.
Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.
Is there ready data structure in .NET 3.5 to do the following
store values sorted by decimal key, dublicates allowed
get next value (enumerator) closest to given key left and right
An example:
car dealer has cars, client asks to find the most expensive car but cheaper than 1000$
You are looking for a binary tree allowing duplicate keys (aka multi-set). There is no such thing in the .NET library, but they are easy to implement (and freely available, eg here or here).
See also Are there any implementations of multiset for .Net?
You just use List to store car price and sort it every time you add new car price.
List<int> PriceList= new List<int>();
PriceList.Sort();
I recommend you to use SQLite in for such queries.
And your query will be:
from car in cars where car.Price < 1000
order by car.Price descending
select car
I know that you're looking for ready-made structures, but one option that might be worth exploring is a van Emde Boas Tree. This structure gives you lookup, find successor, find predecessor, and delete all in O(lg lg n), time, which is exponentially faster than a balanced search tree. I'm not familiar with any implementations of this structure in C#, but it's probably the asymptotically optimal way of solving your problem. If you're storing a large number of integers, it can also be very space-efficient.
I have set of 'codes' Z that are valid in a certain time period.
Since I need them a lot of times in a large loop (million+) and every time I have to lookup the corresponding code I cache them in a List<>. After finding the correct codes, i'm inserting (using SqlBulkCopy) a million rows.
I lookup the id with the following code (l_z is a List<T>)
var z_fk = (from z in l_z
where z.CODE == lookupCode &&
z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
In other situations I have used a Dictionary with superb performance, but in those cases I only had to lookup the id based on the code.
But now with searching on the combination of fields, I am stuck.
Any ideas? Thanks in advance.
Create a Dictionary that stores a List of items per lookup code - Dictionary<string, List<Code>> (assuming that lookup code is a string and the objects are of type Code).
Then when you need to query based on lookupDate, you can run your query directly off of dict[lookupCode]:
var z_fk = (from z in dict[lookupCode]
where z.VALIDFROM <= lookupDate &&
z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
Then just make sure that whenever you have a new Code object, that it gets added to the List<Code> collection in the dict corresponding to the lookupCode (and if one doesn't exist, then create it).
A simple improvement would be to use...
//in initialization somewhere
ILookup<string, T> l_z_lookup = l_z.ToLookup(z=>z.CODE);
//your repeated code:
var z_fk = (from z in lookup[lookupCode]
where z.VALIDFROM <= lookupDate && z.VALIDUNTIL >= lookupDate
select z.id).SingleOrDefault();
You could further use a more complex, smarter data structure storing dates in sorted fashion and use a binary search to find the id, but this may be sufficient. Further, you speak of SqlBulkCopy - if you're dealing with a database, perhaps you can execute the query on the database, and then simply create the appropriate index including columns CODE, VALIDUNTIL and VALIDFROM.
I generally prefer using a Lookup over a Dictionary containing Lists since it's trivial to construct and has a cleaner API (e.g. when a key is not present).
We don't have enough information to give very prescriptive advice - but there are some general things you should be thinking about.
What types are the time values? Are you comparing date times or some primitive value (like a time_t). Think about how your data types affects performance. Choose the best ones.
Should you really be doing this in memory or should you be putting all these rows in to SQL and letting it be queried on there? It's really good at that.
But let's stick with what you asked about - in memory searching.
When searching is taking too long there is only one solution - search fewer things. You do this by partitioning your data in a way that allows you to easily rule out as many nodes as possible with as few operations as possible.
In your case you have two criteria - a code and a date range. Here are some ideas...
You could partition based on code - i.e. Dictionary> - if you have many evenly distributed codes your list sizes will each be about N/M in size (where N = total event count and M = number of events). So a million nodes with ten codes now requires searching 100k items rather than a million. But you could take that a bit further. The List could itself be sorted by starting time allowing a binary search to rule out many other nodes very quickly. (this of course has a trade-off in time building the collection of data). This should provide very quick
You could partition based on date and just store all the data in a single list sorted by start date and use a binary search to find the start date then march forward to find the code. Is there a benefit to this approach over the dictionary? That depends on the rest of your program. Maybe being an IList is important. I don't know. You need to figure that out.
You could flip the dictionary model partition the data by start time rounded to some boundary (depending on the length, granularity and frequency of your events). This is basically bucketing the data in to groups that have similar start times. E.g., all the events that were started between 12:00 and 12:01 might be in one bucket, etc. If you have a very small number of events and a lot of highly frequent (but not pathologically so) events this might give you very good lookup performance.
The point? Think about your data. Consider how expensive it should be to add new data and how expensive it should be to query the data. Think about how your data types affect those characteristics. Make an informed decision based on that data. When in doubt let SQL do it for you.
This to me sounds like a situation where this could all happen on the database via a single statement. Then you can use indexing to keep the query fast and avoid having to push data over the wire to and from your database.