What is the best way to get numbers from a sentence?

What is the best way to get numbers from a sentence? - c#

I'm just trying to get the numbers that are in a sentence.
They can be currency, regular numbers, + and -.
Example would be:
Gym membership 7 months #$20 per month $140.00
Gym membership refund $-100.00
I've been using this expression:
\$?(\d{1,3},?(\d{3},?)*\d{3}(.\d{0,2})?|\d{1,3}(.\d{2})?)
I've been using this website to test it.
http://rubular.com/
The only problem is it doesn't pick up $-100.00, it only picks up 100.00.
I'm also interested if there is a better way to do this or is this the only way.

\$?-?(\d{1,3},?(\d{3},?)*\d{3}(.\d{0,2})?|\d{1,3}(.\d{2})?)
Just add a -? there.
Whether there is a better way depends on what your requirements are. If this is working fine for you and does everything you want, I see no reason to use something else.

Related

How to create unit tests for a fair distribution algorithm?

I have the following algorithm:
Given a list of accounts, I have to divide them fairly between system users.
Now, in order to ease the workload on the users I have to split them over days.
So, if an account has service orders they must be inserted to the list that will be distributed over 547 days (a year and a half). If an account has no service orders they must be inserted to the list that will be distributed over 45 days (a month and a half).
I am using the following LINQ extension from a question I asked before:
public static IEnumerable<IGrouping<TBucket, TSource>> DistributeBy<TSource, TBucket>(
this IEnumerable<TSource> source, IList<TBucket> buckets)
{
var tagged = source.Select((item,i) => new {item, tag = i % buckets.Count});
var grouped = from t in tagged
group t.item by buckets[t.tag];
return grouped;
}
and I can guarantee that it works.
My problem is that I don't really know how to unit test all those cases.
I might have 5 users and 2 accounts which might not be enough to be split over a year and a half or even a month and a half.
I might have exactly 547 accounts and 547 system users so each account will be handled by each system user each day.
Basically I don't know what kind of datasets I should create, because it seems that there are too many options, and what should I assert on because I have no idea how the distribution will be.

Start with boundary conditions (natural limits on the input of the method) and any known corner cases (places where the method behaves in an unexpected manner and you need special code to account for this).
For example:
How does the method behave when there are zero accounts?
Zero users?
Exactly one account and user
547 accounts and 547 users
It sounds like you already know a lot of the expected boundary conditions here. The corner cases will be harder to think of initially, but you will probably have hit some of them as you developed the method. They will also naturally come through manual testing - each time you find a bug this a new necessary test.
Once you have tested bounday conditons and corner cases you should also look at a "fair" sample of other situations - like your 5 users and 2 accounts example. You can't (or at least, arguably don't need to) test all possible inputs into the method, but you can test a representative sample of inputs, for things like uneven division of accounts.

I think part of your issue is that your code describes how you are solving your problem, but not what problem you are trying to solve. Your current implementation is deterministic and gives you an answer, but you could easily swap it with another "fair allocator", which would give you an answer that would differ in the details (maybe different users would be allocated different accounts), but satisfy the same requirements (allocation is fair).
One approach would be to focus on what "fairness" means. Rather than checking the exact output of your current implementation, maybe reframe it so that at a high level, it looks like:
public interface IAllocator
{ IAllocation Solve(IEnumerable<User> users, IEnumerable<Account> accounts); }
and write tests which verify not the specific implementation of accounts to users, but that the allocation is fair, to be defined - something along "every user should have the same number of accounts allocated, plus or minus one". Defining what fair is, or what the exact goal of your algorithm is, should help you identify the corner cases of interest. Working off a higher-level objective (the allocation should be fair, and not the specific allocation) should allow you to easily swap implementations and verify whether the allocator is doing its job.

Recognize Dates In A String

I want a class something like this:
public interface IDateRecognizer
{
DateTime[] Recognize(string s);
}
The dates might exist anywhere in the string and might be any format. For now, I could limit to U.S. culture formats. The dates would not be delimited in any way. They might have arbitrary amounts of whitespace between parts of the date. The ideas I have are:
ANTLR
Regex
Hand rolled
I have never used ANTLR, so I would be learning from scratch. I wonder if there are libraries or code samples out there that do something similar that could jump start me. Is ANTLR too heavy for such a narrow use?
I have used Regex a lot before, but I hate it for all the reasons that most people hate it.
I could certainly hand roll it but I'd rather not re-solve a solved problem.
Suggestions?
UPDATE: Here is an example. Given this input:
This is a date 11/3/63. Here is
another one: November 03, 1963; and
another one Nov 03, 63 and some
more (11/03/1963). The dates could be
in any U.S. format. They might have
dashes like 11-2-1963 or weird extra
whitespace inside like this:
Nov 3, 1963,
and even maybe the comma is missing
like [Nov 3 63] but that's an edge
case.
The output should be an array of seven DateTimes. Each date would be the same: 11/03/1963 00:00:00.
UPDATE: I totally hand rolled this, and I am happy with the result. Instead of using Regex, I ended up using DateTime.TryParse with a custom DateTimeFormatInfo, which allows you to very easily fine tune what formats are allowed and also handling of 2 digit years. Performance is quite acceptable given that this is handled async. The tricky part was tokenizing and testing sets of adjacent tokens in an efficient way.

I'd go for some hand rolled solution to chop the input string into manageable size to let some Regex'es do the work. This seems like a great test to start with unit testing.

I'd suggest you to go with the regex. I'd put one regex (matching one date) into one string and multiple of them into an array. Then create the full regex in runtime. This makes the system more flexible. Depending what you need, you could consider putting the different date-regex into a (XML)file / db.

Recognising dates seems to be a straight forward and easy task for Regex. I cannot understand why you are trying to avoid it.
ANTLR for this case where you have a very limited set of semantics is just overkill.
While performance could be a potential issue but I would really doubt if other options would give you better performance.
So I would go with Regex.

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements:
PersonID (meaningless surrogate autonumber)
FirstName
MiddleInitial
LastName
NameSuffix
DateOfBirth
AlternateID (like an SSN, Militarty ID, etc.)
I get lots of data feeds in from all kinds of formats with every reasonable variation on these pieces of information you could think of. Some examples are:
FullName, DOB
FullName, Last 4 SSN
First, Last, DOB
When this data comes in, I need to write something to match it up. I don't need, or expect, to get more than an 80% match rate. After the automated match, I'll present the uncertain matches on a web page for someone to manually match.
Some of the complexities are:
Some data matches are better than others, and I would like to assign weight to those. For example, if the SSN matches exactly but the name is off because someone goes by their middle name, I would like to assign a much higher confidence value to that match than if the names match exactly but the SSNs are off.
The name matching has some difficulties. John Doe Jr is the same as John Doe II, but not the same as John Doe Sr., and if I get John Doe and no other information, I need to be sure the system doesn't pick one because there's no way to determine who to pick.
First name matching is really hard. You have Bob/Robert, John/Jon/Jonathon, Tom/Thomas, etc.
Just because I have a feed with FullName+DOB doesn't mean the DOB field is filled for every record. I don't want to miss a linkage just because the unmatched DOB kills the matching score. If a field is missing, I want to exclude it from the elements available for matching.
If someone manually matches, I want their match to affect all future matches. So, if we ever get the same exact data again, there's no reason not to automatically match it up next time.
I've seen that SSIS has fuzzy matching, but we don't use SSIS currently, and I find it pretty kludgy and nearly impossible to version control so it's not my first choice of a tool. But if it's the best there is, tell me. Otherwise, are there any (preferably free, preferably .NET or T-SQL based) tools/libraries/utilities/techniques out there that you've used for this type of problem?

There are a number of ways that you can go about this, but having done this type of thing before i will go ahead and put out here that you run a lot of risk in having "incorrect" matches between people.
Your input data is very sparse, and given what you have it isn't the most unique, IF not all values are there.
For example with your First Name, Last Name, DOB situation, if you have all three parts for ALL records, then the matching gets a LOT easier for you to work with. If not though you expose yourself to a lot of potential for issue.
One approach you might take, on the more "crude" side of things is to simply create a process using a series of queries that simply identifies and classifies matching entries.
For example first check on an exact match on name and SSN, if that is there flag it, note it as 100% and move on to the next set. Then you can explicitly define where you are fuzzy so you know the potential ramification of your matching.
In the end you would have a list with flags indicating the match type, if any for that record.

This is a problem called record linkage.
While it's for a python library, the documentation for dedupe gives a good overview of how to approach the problem comprehensively.

Take a look at the Levenshtein Algoritm, which allows you to get 'the distance between two strings,' which can then be divided into the length of the string to get a percentage match.
http://en.wikipedia.org/wiki/Levenshtein_distance
I have previously implemented this to great success. It was a provider portal for a healthcare company, and providers registered themselves on the site. The matching was to take their portal registration and find the corresponding record in the main healthcare system. The processors who attended to this were presented with the most likely matches, ordered by percentage descending, and could easily choose the right account.

If the false positives don't bug you and your languages are primarily English, you can try algorithms like Soundex. SQL Server has it as a built-in function. Soundex isn't the best, but it does do a fuzzy matching and is popular. Another alternative is metaphone.

System constant for the number of days in a week (7)

Can anyone find a constant in the .NET framework that defines the number of days in a week (7)?
DateTime.DaysInAWeek // Something like this???
Of course I can define my own, but I'd rather not if it's somewhere in there already.
Update:
I am looking for this because I need to allow the user to select a week (by date, rather than week number) from a list in a DropDownList.

You could probably use System.Globalization.DateTimeFormatInfo.CurrentInfo.DayNames.Length.

I think it's ok to harcode this one. I don't think it will change any soon.
Edit: I depends where you want to use this constant. Inside the some calendar related algorithm it is obvious what 7 means. On the other hand sometimes named constant make code much more readable.

Try this:
Enum.GetNames(System.DayOfWeek).Length

If you look at the IL code for Calendar.AddWeeks you will see that Microsoft itself uses a hardcoded 7 in the code.
Also the rotor source uses a hardcoded 7.
Still, I would suggest to use a const.

I used this:
public static readonly int WeekNumberOfDays = Enum.GetNames(typeof(DayOfWeek)).Length;

I don't believe there is one. TimeSpan defines constants for the number of ticks per milli/second/minute/hour/day, but nothing at the level of a week.
I ran a query across the standard libraries for symbols (methods/constants/fields/etc) containing the word 'Week'. Nothing came back. FYI, I ran this query using ReSharper.

I'm not sure what exactly you're looking for, but you can try DateHelper (CODE.MSDN). It's a library I put together for typical date needs. You might be able to use the week methods or the List methods. method list
Edit - no more MSDN code, not on GitHub as part of lib: https://github.com/tbasallo/toolshed

Do you mean calendar weeks or just common weeks?
Obviously, there are calendar weeks that might be shortrer than seven days. The last calendar week of the year is usually shorter, and depending on your definition of calendar week, the first week might be shorter as well.
In that case, I'm afraid you will have to roll out your own week length function. It's not really hard to do with the DateTime class, I did it before, if you need more help let me know.

GregorianCalendar has AddWeeks(1) which will add 7 days to a date.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.