Extract unique listo of fields from maching document

Extract unique listo of fields from maching document - c#

I am new to Lucene, so maybe i have missunderstood something about how it works.
I have indexed few hundred thousand documents with many string field. For example suppose we have 5 string field (named A,B,C,D,E) and the first 3 are indexed (A,B,C) leaving the last two unindexed, only included into the document (i mean D,E). Values in each field may be duplicate, for example assume that the field A is used to store names, and the name 'Richard' appear many times.
When i apply a query i looking for each term in each field, now for example, suppose i get 3K documents that match my query.
Is it possible to get a list of unique values (distinct) of each fields without scan and group the result? I am particularly interested into this because i apply a limit to the documents i actually read, but i would like to get a complete list of unique values in each fields (even the documents i dont' read) of the matching documents.
If this is possibile, can i apply this logic even for unindexed fields (D,E) ?

When doing the search, it will return to you all the documents that have the query conditions. On that result you can do a highlight (which will slow the process), but you can do something like pagination to return the result in pages if you want.
In the highligher you have many methods you can use (depending on what version of Lucene you are using; I am talking here about the last version 4.8.0) like GetBestTextFragments() which takes a parameter called maxNumberFragments. If you set that parameter to 1 then it will return only one value from that particular field even if there might be multiple values that match the query.
I am not sure if that answers your question, but I hope it helps. Regarding the unindexed fields, I dont think you can do that (although I have never tried it).

Related

Using a classification algorithm for splitting a full name into first and last names?

I have a customers table with 2 columns for the name (firstname, lastname) and it contains around 100k records.
I have a scenario where I have to import new customers but their names come as a single column. Most names are simple (first and last), but some names are double names (with a space or hyphen), double surnames (with a space or hyphen) or even both.
Does using a ML.NET classification algorithm make sense to split the fullname based on a trained model from the 100k records?

I think it would be unnecessary to use machine learning methods for such a problem. You should try a rule-based method here.
Assuming the data comes in 1 column:
For example: After splitting the Text by space, is the length of the word count equal to 2? If equal, the 1st word is the name and the 2nd word is the surname.
Example 2: Does the text contain hyphen or not? If yes, what should I do? How can I determine my name and surname?
1-) What you need to do here is to create a training, validation and test set for yourself.
2-) Doing a coding with the rules you extracted from the data in the train set. (Here you need to make clever deductions by examining the data)
3-) You need to determine the most ideal rules with validation data.
Finally, you should evaluate your work by getting results on the test set with the rule you find most ideal.

SQL & C# Search for a record and expect a record or a set of records?

This might seem subjective, but I'm looking for answers from those who like to set, or at least be a part of setting, coding standards.
In C#, What type of result should you expect when searching for a single record by a non primary key index?
If you :
select * from tablename where fieldname=#fieldname
As a matter of practice, should you code logic to expect an IEnumerable list or a single record?
If you really expect only one record, should the SQL use TOP 1? like below:
select Top 1 * from tablename where fieldname=#fieldname

I think rather than thinking about what you expect, a better way to look at this is construct your query such that you get what you want. If you are only interested in the zero or one potential matches then TOP(1) certainly works. Although I'd likely add some type of ordering clause.
However, if you want zero or more, then the first approach is better.
Any time you are querying based off of a non-unique value you always have the possibility of returning more than one record. Sure, today that query only gets one. However at some point in the future an unforseen change will occur and all of a sudden you now get multiple rows back.

Best choice to store a list of ints in mssql

I am wondering which method is the best way to store a list of integers in a sql column.
.....i.e. "1,2,3,4,6,7"
EDIT: These values represent other IDs in SQL tables. The row would look like
[1] [2]
id, listOfOtherIDs
The choices I have researched so far are:
A varchar of separated value that are "explode-able" i.e. by commas or tabs
An XML containing all the values individually
Using individual rows for each value.
Which method is the best method to use?
Thanks,
Ian

A single element of a record can only refer to one value; it's a basic database design principle.
You will have to change the database's design: use a single row for each value.
You might want to read up on normalization.
As is shown here in the description of the first normal form:
First normal form states that at every row and column intersection in the table there, exists a single value, and never a list of values. For example, you cannot have a field named Price in which you place more than one Price. If you think of each intersection of rows and columns as a cell, each cell can hold only one value.

While Jeroen's answer is valid for "multi-valued" attributes, there are genuine situations where multiple comma-separated values may actually be representing one large value. Things like path data (on a map), integer sequence, list of prime factors and many more could well be stored in a comma-separated varchar. I think it is better to explain what exactly are you storing and how do you need to retrieve and use that value.
EDIT:
Looking at your edit, if by IDs you mean PK of another table, then this sounds like a genuine M-N relation between this table and the one whose IDs you're storing. This stuff should really be stored in a separate gerund, which BTW is a table that would have the PK of each of these tables as FKs, thus linking the related rows of both tables. So Jeroen's answer very well suits your situation.

What type of collection should I use?

I have approximately 10,000 records. Each records has 2 fields: one field is a string up to 300 characters in length and the other field is a decimal value. This is like a product catalog with product names and the price of each product.
What I need to do is allow the user to type any word and display all products containing that word together with their prices in a listbox. That's all.
What type of collection is best for this scenario?
If I need to sort based on either product name or price, will the choice still be the same?
Right now I am using an XML file, but I thought using a collection so that I can embed all the values in the code is simpler. Thanks for your suggestions.

A Dictionary will do the job. However, if you are doing rapid partial matches (e.g. search as the user types) you may get better performance by creating multiple keys which point to the same item. For example, the word "Apple" could be located with "Ap", "App", "Appl", and "Apple".
I have used this approach on a similar number of records with very good results. I have turned my 10K source items into about 50K unique keys. Each of these Dictionary entries points to a list containing references to all matches for that term. You can then search this much smaller list more efficiently. Despite the large number of lists this creates, the memory footprint is quite reasonable.
You can also make up your own keys if desired to redirect common misspellings or point to related items. This also eliminates most of the issues with unique keys because each key points to a list. A single item may be classified by each of the words in its name; this is extremely useful if you have long product names with multiple words in it. When classifying your items, each word in the name can be mapped to one or more keys.
I should also point out that building and classifying 10K items shouldn't take long if done correctly (couple hundred milliseconds is reasonable). The results can be cached for as long as you want using Application, Cache, or static members.
To summarize, the resulting structure is a Dictionary<string, List<T>> where the string is a short (2-6 characters works well) but unique key. Each key points to a List<T> (or other collection, if you are so inclined) of items which match that key. When a search is performed, you locate the key which matches the term provided by the user. Depending on the length of your keys, you may truncate the user's search to your maximum key length. After locating the correct child collection, you then search that collection for a complete or partial match using whatever methodology you wish.
Lastly, you may wish to create a lightweight structure for each item in the list so that you can store additional information about the item. For example, you might create a small Product class which stores the name, price, department, and popularity of the product. This can help you refine the results you show to the user.
All-in-all, you can perform intelligent, detailed, fuzzy searches in real-time.
The aforementioned structures should provide functionality roughly equivalent to a trie.

10K records is not that much.
An Dictionary<string,decimal> would fit the bill. You can sort by key or by value using LINQ, as well as do searches.
This assumes that product names are unique.

SQL Server FTS: possible to get information how/why rows were matched?

Is it possible to get the information why/how given row returned by FTS query was matched (or which substring caused row to match)?
For example, consider simpliest table with id and text columns, with FTS index on the later one.
SELECT * FROM Example
WHERE CONTAINS(text, 'FORMSOF(INFLECTIONAL, jump)');
This examplary query could return, say row {1, 'Jumping Jack'}.
Now, is it possible to somehow get information that this very row was matched because of 'Jumping' word? It doesn't even have to be exact information, more of a which substring caused row to match.
Why I'm asking - I got C# app that builds up those queries basing on user input (keywords to search for), and I need the very basic information why/how row was matched back, to use further in C# code.
If it's not possible, any alternatives?
EDIT in regards of Mike Burton's and LesterDove's replies:
The above example was trivial for obvious reasons and your solutions are ok having that in mind, however FTS queries might return results where regex or simple string matching (eg. LIKE) won't cut in. Consider:
Search for bind returns bound (past form).
Search for extraordinary returns amazing (synonym).
Both valid matches.
I've been looking for solutions to this problem and found this: NHunspell. However, I already got FTS & valid results using SQL Server, duplicating similar mechanism (building extra indexes, storing additional words/thezaurus files etc) doesn't look good.
Lester's answer however gave me some ideas that perhaps I could indeed split the original string to temporary table, and run the original FTS query on this split result. As it might work for my case (where DB is fairly small and queries are not very complex), in general case this approach might be out of question.

1/ Use a SPLIT function (many variations can be Googled) on your original substring, which will dump the individual substrings into a temp table of some sort, with one row per substring snippet.
2/ EDIT: You need to use CROSS APPLY to join to a table valued function:
SELECT * FROM Example E CROSS APPLY Split(E.text, ' ') AS S
WHERE CONTAINS(E.text, 'FORMSOF(INFLECTIONAL, jump)') AND S.String LIKE '%jump%';
*NOTE: You need to forage for your own user-defined Split function. I used this one and applied the first commenter's edit to allow for the space character as a delimiter.
So, E is your Example table. You're still FT searching on the text field for the word 'jump'. And now you're "joining" to a table comprised of the individual substring values of your text field. Finally, you're matching that against the word 'jump' by using LIKE or Instr.

One simple post-processing method would be to generate an equivalent Regular Expression for each WHERE clause article and use it to discover after the fact how the found data matches the specified pattern.

You can get SQL to tell you how it interpreted your query, including how it transformed your input.
SELECT occurrence, special_term, display_term, expansion_type, source_term
FROM sys.dm_fts_parser('FORMSOF(INFLECTIONAL, bind)', 1033, 0, 0)
returns
occurrence special_term display_term expansion_type source_term
1 Exact Match binds 2 bind
1 Exact Match binding 2 bind
1 Exact Match bound 2 bind
1 Exact Match bind 0 bind
This isn't precisely what you asked for, but it's a start. You could search your results for anything in the display_term column and probably figure out why it matched.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.