Super fuzzy name checking?

Super fuzzy name checking? - c#

I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (including fuzzy names) and match them against what is already in our database and ask if they meant those things... Does such a database or technology exist?

I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.
Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.
EDIT: The website is still in archive.org:
http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/
or Google cache:
http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon
which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone
EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:
1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);
2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)
4) SELECT * FROM table WHERE soundlike LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) < mythreshold ORDER BY levenstein(word,$word) ASC LIMIT ... etc.
It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

I asked a similar question once. Name Hypocorism List I never did get around to doing anything with it but the problem has come up again at work so I might write and open source a library in .net for doing some matching.
Update:
I ported the perl module I mentioned there to C# and put it up on github. http://github.com/stimms/Nicknames

Implement the Levenshtein distance:
http://en.wikipedia.org/wiki/Levenshtein_distance
This can be written as a SQL Function and queried many different ways.

Well SSIS has some fuzzy logic tasks we use to find duplicates after the fact.
I think though you need to have your logic look at more than just the name for best results. If they are putting in address, email or phone information, perhaps you could look for people with the same last name with one or more of those other matches and ask if one of them will do. You could also make a table of nicknames for various names and match on that. You won't get all of them, but you could get some of the most common in your country at least.

You can use SOUNDEX to get similar sounding names. However, it won't match with William and Bill for example.
Try this in SQL as an example.
SELECT SOUNDEX('John'), SOUNDEX('Jon')

There is some built-in SOUNDS LIKE functionality in SQL Server, see SOUNDEX http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
As for full / nickname searching there isn't anything built it that I am aware of. Nicknames vary by region and it's a lot of information to keep track of. There might be a database linking full names to nicknames that you could leverage in your own application.

Related

Is it possible for Lucene to monitor a Sql Table and keep itself updated?

I am trying to understand some basics of Lucene, the full text search engine. More specifically I am looking at Lucene.Net.
Today I have an old legacy .NET 4.8 web app. Some is MVC, but the newer parts follow a pretty nice API first pattern. The app holds a lot of records (app half a million) with tons of different fields. The search functionality there is outdated to say the least. It is a ton of old Linq2SQL queries that fan out in like queries.
I would like to introduce a new and better way to search records, so I started looking at Lucene.Net. But I am trying to understand one key concept, and I can't seem to find the answer anywhere, and I think it might be because it cannot be done, but I would like to make sure.
Is it possible to set up Lucene to monitor a SQL table or view so I don't have to maintain the Lucene index from within my code. The code of this app does not lend itself to easily keeping a Lucene index updated when things are added, changed or deleted. But the database is good source of truth. I can live with a small delay on having the index up to date. But basically I would like define for each business model what fields are part of the index and what the id is, and then be able to query with that index from the C# server side code of my Web App.
Is such a scenario even possible or am I asking too much?

It's totally possible, but not out of the box. You have to implement it if you want it. Fundamentally you need to implement three things.
A way to know every time a piece of relevant data in the sql database changes
A place to capture information about that change, call it a change log.
A routine that reads the change log, applies those changes to the
LuceneNet index and than marks the record in the change log has processed.
There are of course lots of different ways to handle each of these.
This SO answer Lucene.Net index updates, when a manual change is done in SQL Database provides more details on one way this can be accomplished.

Filtering a collection of tags-based huge strings with AND, OR, *

I'm stuck at something that looks easy... but that is apparently not.
We've got a "big" (3000 records and counting) collection of words (issued from résumés) stored in a database : one record per candidate, one string property with every word separated by a space.
one of my colleague asks me if it's possible for him to write search strings like "A" AND ("B" or "C*") AND ("D" OR ("E* AND *F"))
when he asked it it looked easy, but I'm stuck on this, not even knowing where to start. from what I've gathered already by looking online, well... it occured to me I've been the only human being with this particular need.
I've read that it looked like I needed tree-like filters, but id someone had a bit of code to get me started, that would be greatly appreciated ;)
if you know a c# library that does it, that would be perfection.
I guess I could migrate my bdd to store each word in a separate table w/foreign keys if it could be of any help...
thanks anyway !

I've finally used Lucene.Net, as it has everything I need already implemented as is.
thanks for your answers.

ASP.NET C# Search in a SQL Server Database Table

I'm building a portal, which isn't a blogging engine, but is quite similar to one. In SQL Server, I have a database containing a table that is the basis for the "posts". This Posts table includes the following columns:
ID
Author(s)
Tags
Title
Markdown post content
This is my first time building such a portal, and I'd like to implement some sort of ASP.NET search over these rows, preferably using all of the properties (columns), except for the ID one. Also, in the long run, I'm considering the possibility of implementing a search of this and comments to those posts, which would be stored in a different table.
Are there any open-source implementations or example code online for accomplishing this search? If not, how can I get started? Could you point me towards some tutorials w/ sample code on how to accomplish this with ASP.NET and C#? Also, has Google (or some other company) created any things for this?
I hope my question isn't too broad or vague. Thanks in advance!

Are you using Sql Server 2008? If so, you could leverage the full-text search features built right into it. Then it would be as simple as passing the user's (sanitized) input into a sql query.
For example, here's a query that would search the Author, Title and PostContent fields for the user's inputted text.
SELECT Author, Title FROM Posts
WHERE CONTAINS((Author, Title, PostContent), #userInput);
SQL Server 2008 supports different search methods too, like simple token, weighted word values, synonym, proximity and prefix searches... it's pretty awesome.

Have you thought of implementing search through the use of Full-Text Search? It seems like a great scenario for it. This link might provide useful information on architecture and development of Full-Text Search. http://msdn.microsoft.com/en-us/library/ms142571.aspx

If this is publicly exposed, you can always have google index it. Google has an appliance which you can purchase to do this as well.
If you dont want to roll your own you could look at:
Sharepoint (regular [comes with Win 2003+] or portal server)
SOLR (Apache Lucene project)
If you want to roll your own, I suggest looking into SQL Server Analysis Services to build the search indexes for you on a regular basis.

I agree with Womp - Full text index is the way to go. You could also look at NLucene if you need to index more than just the database table.

i think you should try to use Lucene which you can use to index all your data and build a really good search engine.
i can give more information about that if you like.

Methods for storing searchable data in C#

In a desktop application, I need to store a 'database' of patient names with simple information, which can later be searched through. I'd expect on average around 1,000 patients total. Each patient will have to be linked to test results as well, although these can/will be stored seperately from the patients themselves.
Is a database the best solution for this, or overkill? In general, we'll only be searching based on a patient's first/last name, or ID numbers. All data will be stored with the application, and not shared outside of it.
Any suggestions on the best method for keeping all such data organized? The method for storing the separate test data is what seems to stump me when not using databases, while keeping it linked to the patient.
Off the top of my head, given a List<Patient>, I can imagine several LINQ commands to make searching a breeze, although with a list of 1,000 - 10,000 patients, I'm unsure if there's any performance concerns.

Use a database. Mainly because what you expect and what you get (especially over the long term) tend be two totally different things.

This is completely unrelated to your question on a technical level, but are you doing this for a company in the United States? What kind of patient data are you storing?
Have you looked into HIPAA requirements and checked to see if you're a covered entity? Be sure that you're complying with all legal regulations and requirements!

I think 1000 is to much to try to store in XML. I'd go with a simple db type, like access or Sqlite. Yes, as a matter of fact, I'd probably use Sqlite. Sql Server Express is probably overkill for it. http://sqlite.phxsoftware.com/ is the .net provider.

I would recommend a database. You can use SQL Server Express for something like that. Trying to use XML or something similar would probably get out of hand with that many rows.
For smaller databases/apps like this I've yet to notice any performance hits from using LINQ to SQL or Entity Framework.

I would use SQL Server Express because it has the best tool support (IDE integration) from Microsoft. I don't see any reason to consider it overkill.
Here's an article on how to embed it directly in your application (no separate installation needed).

If you had read-only files provided by another party in some kind of standard format which were meant to be used by the application, then I would consider simply indexing them according to your use cases and running your searches and UI against that. But that's still some customized work.
Relational databases are great for storing data in tables, and for representing the relationships between tables. Typically there are also good tools for getting the data in and out.
There are other systems you could use to store your data, but none which would so quickly be mapped to your input (you didn't mention how your data would get into this system) and then be queryable against with least effort.
Now, which database to choose...

Use Database...but maybe just SQLite, instead of a fully fledged database like MS SQL (Express).

How can I do search efficiently data in Database except using fullsearch

I want to search a sentence (word combination of) in some table or view of DB. I dont want to use Fultext search property of DB. Is there any alternative efficient way?

Without the use of an index, a database has to perform a "full table scan". This is rather like you looking through a book one page at a time to find what you need.
That being said, computers are a lot faster than humans. It really depends on how much load your system has. Using MySQL we successfully implemented a search system on a table of lead information. The nature of the problem was one that could not be solved by normal indexes (including full text). So we designed it to be powered using a full table scan.
That involved creating tables as narrow as possible with the search data, and joining them to a larger table with related, but non-search data.
At the time (4 years ago), 100,000 records could be scanned in .06 seconds. 1,000,000 records took about .6 seconds. The system is still in heavy production use with millions of records.
If your data needs exceed 6 digits of records, you may want to re-evaluate using a full text index, or do some research on inverted indexes.
Please comment if you would like any more info.
Edit: The search tables were kept as narrow as possible. Ideally 50-100 bytes per record. ENUMS and TINYINT are great space savers if you can use them to "map" to string values another way.
The search queries were generated using a PHP class. They were simply:
-- DataTable is the big table that holds all of the data
-- SearchTable is the narrow table that holds the bits of searchable data
SELECT
MainTable.ID,
MainTable.Name,
MainTable.Whatever
FROM
MainTable, SearchTable
WHERE
MainTable.ID = SearchTable.ID
AND SearchTable.State IN ('PA', 'DE')
AND SearchTable.Age < 40
AND SearchTable.Status = 3
Essentially, the two tables were joined on a primary key (fast), and the filtering was done by full table scan on the SearchTable (pretty fast). We were using MySQL.
We found that by having the record format == "FIXED" in the MyISAM tables, we could increase performace by 3x. This meant no blobs, no varchars, etc...
Let me know if this helps.

None as efficient as Fulltext search.
Basically it boils down to where with like derivatives and since indexes are tossed away in most of the scenarios , it becomes a very expensive query.

If you are using JAVA have at look at Lucene
If you are using .net, you can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
Following from http://incubator.apache.org/lucene.net/
Lucene.Net is a source code,
class-per-class, API-per-API and
algorithmatic port of the Java Lucene
search engine to the C# and .NET
platform utilizing Microsoft .NET
Framework.
Lucene.Net sticks to the APIs and
classes used in the original Java
implementation of Lucene. The API
names as well as class names are
preserved with the intention of giving
Lucene.Net the look and feel of the C#
language and the .NET Framework. For
example, the method Hits.length() in
the Java implementation now reads
Hits.Length() in the C# port.
In addition to the APIs and classes
port to C#, the algorithm of Java
Lucene is ported to C# Lucene. This
means an index created with Java
Lucene is back-and-forth compatible
with the C# Lucene; both at reading,
writing and updating. In fact a Lucene
index can be concurrently searched and
updated using Java Lucene and C#
Lucene processes.

You could break up the text into individual words, stick them in a separate table, and use that to find PK IDs that have all the words in your search sentence [i.e. but not necessarily in the right order], and then search just those rows for the sentence. Should avoid having to do a table scan every time.
Please ask if you need me to explain further

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.