Fuzzy pattern matching from emails in C# - c#

I'm looking for a way to extract bits of data from emails. I'm primarily looking at subject lines and the email body, and extracting customer and order reference numbers.
Imagine I'm a company where customers can email an info#mydomain.com and they might add a specific customer number or order reference in the subject line or body of the email. However, they might not always provide these references in the optimal format. I want to extract the data out, and return a probability of how likely the data is valid.
Is there some kind of technique I can use to attempt to scan an email and return a probable customer number and or order reference with a degree of probability (a bit like Bayesian spam filtering)?
I was considering some kind of regular expression engine, but that seemed too rigid. I was also looking at NUML.net and wondering if it could help me, but I'm a little out of my depth, since I'm not entirely sure what I need. I've come across the Levenshtein algorithm, but that seems to be matching two fixed strings, rather than a fixed string and a pattern.
I'm imagining an API that look a little like this:
// emailMessage is a Mandrill inbound object, in case anybody wonders
EmailScanResult results = EmailScanner.Scan(emailMessage, new {ScanType.CustomerNo, ScanType.OrderReference});
foreach (var result in results)
{
var scanType = result.Type; // I.e. ScanType.CustomerNo
var score = result.Score; // e.g. 1.2
var value = result.Value; // CU-233454345-2321
}
Possible inputs for this are varied; E.g. For the same customer number:
DF-232322-AB2323
df-232322-AB2323
232322-ab2323
232322AB2323
What kinds of algorithms would be useful for such a task? Are there any recommended .NET libraries for this, and do you know of any appropriate examples?

If I got it right, you could use a regular expression with no problem. For example, with the input samples you gave, you could use a regex like:
([A-Z|a-z]{2,2}-){0,1}\d{6,6}-{0,1}\d{4,4}
The first part gets the DF- or df-, which may or may not occur: ([A-Z|a-z]{2,2}-){0,1}
The second part gets the first group of digits: \d{6,6}
Then, we say that it could have a dash: \-{0,1}
Finally, we get the last group of digits: \d{4,4}
This would cover the values you provided as sample, but you also could write other expressions to fetch other values.
Or, maybe, you could use something like Lucene.net. From what I know, this could help you too.
http://pt.slideshare.net/nitin_stephens/lucene-basics
http://jsprunger.com/getting-started-with-lucene-net/

Related

How to find minimum replacement strings or regex to convert string to another string

Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6
Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.

Is it possible to guide a Markov chain toward certain keywords?

I'm writing a chat bot for a software engineering course in C#.
I'm using Markov chains to generate text, using Wikipedia articles as the corpus. I want it to respond to user input in an (at least slightly) intelligent way, based on their input, but I'm not sure how to do it.
My current thinking is that I'd try and extract keywords from the user's input, then use those to guide the sentence generation. But because of the Markov property, the keywords would have to be the first words in the sentence, which might look silly. As well, for an n order chain, I'd have to extract exactly n keywords from the user every time.
The data for the generator is a dictionary, where the keys are lists of words, and the values are lists of words combined with a weight depending on how often the word appears after the words in the key. So like:
{[word1, word2, ..., wordn]: [(word, weight), (word, weight), ...]}
It works in a command-line test program, but I'm just providing an n word seed for each bit of text it generates.
I'm hoping there's some way I can make the chain prefer words which are nearby words that the user used, rather than seeding it with the first/last n words in the input, or n keywords, or whatever. Is there a way to do that?
One way to make your chat smarter is to identify the topic from the user's input. Assume you have your Markov brain conditioned on different topics as well. Then to construct your answer, you refer to the dictionary below:
{([word1, word2, ..., wordn], topic): [(word, weight), (word, weight), ...]}
To find the topics, you can start with WikipediaMiner. For instance, below are the topics and their corresponding weights found by wikify api against the sentence:
Statistics is so hard. Do you have some good tutorial of probability theory for a beginner?
[{'id': 23542, 'title': 'Probability theory', 'weight': 0.9257584778725553},
{'id': 30746, 'title': 'Theory', 'weight': 0.7408577501980528},
{'id': 22934, 'title': 'Probability', 'weight': 0.7089442931022307},
{'id': 26685, 'title': 'Statistics', 'weight': 0.7024251356953044}]
Probably those identified keywords are also good to be treated as seeds.
However, question answering is not so simple. This Markov-based sentence generation does not have the ability to understand the question at all. The best it can do is just providing related contents. Just my 2 cents.

Findings string segments in a string

I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.
In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.
I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?
There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.
This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.
Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.
If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

Categories

Resources