Recognize Dates In A String

Recognize Dates In A String - c#

I want a class something like this:
public interface IDateRecognizer
{
DateTime[] Recognize(string s);
}
The dates might exist anywhere in the string and might be any format. For now, I could limit to U.S. culture formats. The dates would not be delimited in any way. They might have arbitrary amounts of whitespace between parts of the date. The ideas I have are:
ANTLR
Regex
Hand rolled
I have never used ANTLR, so I would be learning from scratch. I wonder if there are libraries or code samples out there that do something similar that could jump start me. Is ANTLR too heavy for such a narrow use?
I have used Regex a lot before, but I hate it for all the reasons that most people hate it.
I could certainly hand roll it but I'd rather not re-solve a solved problem.
Suggestions?
UPDATE: Here is an example. Given this input:
This is a date 11/3/63. Here is
another one: November 03, 1963; and
another one Nov 03, 63 and some
more (11/03/1963). The dates could be
in any U.S. format. They might have
dashes like 11-2-1963 or weird extra
whitespace inside like this:
Nov 3, 1963,
and even maybe the comma is missing
like [Nov 3 63] but that's an edge
case.
The output should be an array of seven DateTimes. Each date would be the same: 11/03/1963 00:00:00.
UPDATE: I totally hand rolled this, and I am happy with the result. Instead of using Regex, I ended up using DateTime.TryParse with a custom DateTimeFormatInfo, which allows you to very easily fine tune what formats are allowed and also handling of 2 digit years. Performance is quite acceptable given that this is handled async. The tricky part was tokenizing and testing sets of adjacent tokens in an efficient way.

I'd go for some hand rolled solution to chop the input string into manageable size to let some Regex'es do the work. This seems like a great test to start with unit testing.

I'd suggest you to go with the regex. I'd put one regex (matching one date) into one string and multiple of them into an array. Then create the full regex in runtime. This makes the system more flexible. Depending what you need, you could consider putting the different date-regex into a (XML)file / db.

Recognising dates seems to be a straight forward and easy task for Regex. I cannot understand why you are trying to avoid it.
ANTLR for this case where you have a very limited set of semantics is just overkill.
While performance could be a potential issue but I would really doubt if other options would give you better performance.
So I would go with Regex.

Related

How to find minimum replacement strings or regex to convert string to another string

Ok the title may be not correct but this is what i came as best
My question is this
Example 1
see , saw
I can convert see to saw with as
replace ee with aw
string srA = "see";
string srB = "saw";
srA = srB.Replace("aw", "ee");
Or lets say
show , shown
add n to original string
Now what i want it is, with minimum length of code, generating such procedures to any compared strings
Looking for your ideas how can i make it? Can i generate regexes automatically to apply and convert?
c# 6

Check diffplex and and see if it is what you need. If you want to create a custom algorithm, instead of using a 3rd party library just go through the code -it's open source.
You might also want to check this work for optimizations, but it might get complicated.
Then there's also Diff.NET.
Also this blog post is part of a series in implementing a diff tool.
If you're simply interested in learning more about the subject, your googling efforts should be directed to the Levenshtein algorithm.
I can only assume what your end goal is, and the time you're willing to invest in this, but I believe the first library should be enough for most needs.

Methodologies or algorithms for filling in missing data

I am dealing with datasets with missing data and need to be able to fill forward, backward, and gaps. So, for example, if I have data from Jan 1, 2000 to Dec 31, 2010, and some days are missing, when a user requests a timespan that begins before, ends after, or encompasses the missing data points, I need to "fill in" these missing values.
Is there a proper term to refer to this concept of filling in data? Imputation is one term, don't know if it is "the" term for it though.
I presume there are multiple algorithms & methodologies for filling in missing data (use last measured, using median/average/moving average, etc between 2 known numbers, etc.
Anyone know the proper term for this problem, any online resources on this topic, or ideally links to open source implementations of some algorithms (C# preferably, but any language would be useful)

The term you're looking for is interpolation. (obligatory wiki link)
You're asking for a C# solution with datasets but you should also consider doing this at the database level like this.
An simple, brute-force approach in C# could be to build an array of consecutive dates with your beginning and ending values as the min/max values. Then use that array to merge "interpolated" date values into your data set by inserting rows where there is no matching date for your date array in the dataset.
Here is an SO post that gets close to what you need: interpolating missing dates with C#. There is no accepted solution but reading the question and attempts at answers may give you an idea of what you need to do next. E.g. Use the DateTime data in terms of Ticks (long value type) and then use an interpolation scheme on that data. The convert the interpolated long values to DateTime values.

The algorithm you use will depend a lot on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It could also incorporate other information you might know about what's missing, as is common in statistics, when your actual data may not reflect the same distribution as the universe across certain categories.
Linear and cubic interpolation are typical algortihms that are not difficult to implement, try googling those.
Here's a good primer with some code:
http://paulbourke.net/miscellaneous/interpolation/
The context of the discussion in that link is graphics but the concepts are universally applicable.

For the purpose of feeding statistical tests, a good search term is imputation - e.g. http://en.wikipedia.org/wiki/Imputation_%28statistics%29

Approximate string matching

I know this question have been asked a lot of time.
I want a suggestion on which algorithm is suitable for approximate string matching.
The application is specifically for company name matching only and nothing else.
The biggest challenge is probably the company end name part and short named part
Example:
1. companyA pty ltd vs companyA pty. ltd. vs companyA
2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
Do you think Levenshtein Edit Distance is adequate?
I'm using C#
Regards,
Max

There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

Yes, Levenshtein distance is suitable for this. It will work for all those you have listed at least.
You could also possibly use Soundex, but I don't think you'll need it.

In these simple examples, just removing all non-alpha-numeric characters gives you a match, and is the easiest to do as you can pre-compute the data on each side, then do a straight equals match which will be a lot faster than cross multiplying and calculating the edit distance.

I have provided my answer already in another question.
https://stackoverflow.com/a/30120166/2282794
I have worked on really large scale system with similar name matching requirements that you have talked about.
Name matching is not very straightforward and the order of first and last names might be different.
Simple fuzzy name matching algorithms fail miserably in such scenarios.
If we just want to talk about the Approximate String matching algorithms, then there are many. Few of them are: Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc. A simple googling would give us all the details.
You can implement all of them in C#
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
I might have just talked about Lucene which is specific to Java but there is Lucene for .Net also.
https://lucenenet.apache.org/

Should we store format strings in resources?

For the project that I'm currently on, I have to deliver specially formatted strings to a 3rd party service for processing. And so I'm building up the strings like so:
string someString = string.Format("{0}{1}{2}: Some message. Some percentage: {3}%", token1, token2, token3, number);
Rather then hardcode the string, I was thinking of moving it into the project resources:
string someString = string.Format(Properties.Resources.SomeString, token1, token2, token3, number);
The second option is in my opinion, not as readable as the first one i.e. the person reading the code would have to pull up the string resources to work out what the final result should look like.
How do I get around this? Is the hardcoded format string a necessary evil in this case?

I do think this is a necessary evil, one I've used frequently. Something smelly that I do, is:
// "{0}{1}{2}: Some message. Some percentage: {3}%"
string someString = string.Format(Properties.Resources.SomeString
,token1, token2, token3, number);
..at least until the code is stable enough that I might be embarrassed having that seen by others.

There are several reasons that you would want to do this, but the only great reason is if you are going to localize your application into another language.
If you are using resource strings there are a couple of things to keep in mind.
Include format strings whenever possible in the set of resource strings you want localized. This will allow the translator to reorder the position of the formatted items to make them fit better in the context of the translated text.
Avoid having strings in your format tokens that are in your language. It is better to use
these for numbers. For instance, the message:
"The value you specified must be between {0} and {1}"
is great if {0} and {1} are numbers like 5 and 10. If you are formatting in strings like "five" and "ten" this is going to make localization difficult.
You can get arround the readability problem you are talking about by simply naming your resources well.
string someString = string.Format(Properties.Resources.IntegerRangeError, minValue, maxValue );
Evaluate if you are generating user visible strings at the right abstraction level in your code. In general I tend to group all the user visible strings in the code closest to the user interface as possible. If some low level file I/O code needs to provide errors, it should be doing this with exceptions which you handle in you application and consistent error messages for. This will also consolidate all of your strings that require localization instead of having them peppered throughout your code.

One thing you can do to help add hard coded strings or even speed up adding strings to a resource file is to use CodeRush Xpress which you can download for free here: http://www.devexpress.com/Products/Visual_Studio_Add-in/CodeRushX/
Once you write your string you can access the CodeRush menu and extract to a resource file in a single step. Very nice.
Resharper has similar functionality.

I don't see why including the format string in the program is a bad thing. Unlike traditional undocumented magic numbers, it is quite obvious what it does at first glance. Of course, if you are using the format string in multiple places it should definitely be stored in an appropriate read-only variable to avoid redundancy.
I agree that keeping it in the resources is unnecessary indirection here. A possible exception would be if your program needs to be localized, and you are localizing through resource files.

yes you can
new lets see how
String.Format(Resource_en.PhoneNumberForEmployeeAlreadyExist,letterForm.EmployeeName[i])
this will gave me dynamic message every time
by the way I'm useing ResXManager

System constant for the number of days in a week (7)

Can anyone find a constant in the .NET framework that defines the number of days in a week (7)?
DateTime.DaysInAWeek // Something like this???
Of course I can define my own, but I'd rather not if it's somewhere in there already.
Update:
I am looking for this because I need to allow the user to select a week (by date, rather than week number) from a list in a DropDownList.

You could probably use System.Globalization.DateTimeFormatInfo.CurrentInfo.DayNames.Length.

I think it's ok to harcode this one. I don't think it will change any soon.
Edit: I depends where you want to use this constant. Inside the some calendar related algorithm it is obvious what 7 means. On the other hand sometimes named constant make code much more readable.

Try this:
Enum.GetNames(System.DayOfWeek).Length

If you look at the IL code for Calendar.AddWeeks you will see that Microsoft itself uses a hardcoded 7 in the code.
Also the rotor source uses a hardcoded 7.
Still, I would suggest to use a const.

I used this:
public static readonly int WeekNumberOfDays = Enum.GetNames(typeof(DayOfWeek)).Length;

I don't believe there is one. TimeSpan defines constants for the number of ticks per milli/second/minute/hour/day, but nothing at the level of a week.
I ran a query across the standard libraries for symbols (methods/constants/fields/etc) containing the word 'Week'. Nothing came back. FYI, I ran this query using ReSharper.

I'm not sure what exactly you're looking for, but you can try DateHelper (CODE.MSDN). It's a library I put together for typical date needs. You might be able to use the week methods or the List methods. method list
Edit - no more MSDN code, not on GitHub as part of lib: https://github.com/tbasallo/toolshed

Do you mean calendar weeks or just common weeks?
Obviously, there are calendar weeks that might be shortrer than seven days. The last calendar week of the year is usually shorter, and depending on your definition of calendar week, the first week might be shorter as well.
In that case, I'm afraid you will have to roll out your own week length function. It's not really hard to do with the DateTime class, I did it before, if you need more help let me know.

GregorianCalendar has AddWeeks(1) which will add 7 days to a date.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.