First of all I'd like to mention that I'm new to programming and this sight so I'm still an infant in this world, however, I have a problem.
I have to make code that can compare two strings but the second string (from a file) will have unique identifiers within it. For example:
first string:
I have 10 cats and their fur is #000000
Second string from a file:
I have <d> cats and their fur is <h>
Although I probably don't need to explain, 'd' is for numbers or decimal and 'h' for hex. There are also 's' and 'a' associated to ASCII.
What's supposed to happen is that the first string can have any different number which can be of different length and/or Hex when the data comes in but the rest of the message stays the same, E.G.
I have 1500 cats and their fur is #000000
the code will still match the two strings as True matches as it'll effectively ignore anything that is an int and hex. (this identifiers are User defined so they can be anywhere in any string).
The end game is that if it finds a relative match the code will change the colour of the text in the app among other things. it's basically to highlight errors in a log file.
I've searched High an low on Stackflow and looked into Regex and string comparisons. I'm currently going to make a start on the code, however, would like some input/help.
Obviously I'm not asking for something to be written for me, just to be pointed in the right direction so I can learn.
Many thanks in advance! And apologies if there is a similar post out there, but alas I couldn't find it if there is.
If I understand it correctly I think I would solve this by replacing the <d> etc. by a RegEx expression. Then use that RegEx to replace the values by an empty string. That way you can compare them without the values.
Hope that makes sense. I didn't include any code because you asked for just some directions.
Related
I am currently working on a little dictionary app for Korean in C# (which I am trying to learn). I would like to add a feature where a conjugation chart is given with all basic verb forms for a certain verb. To ensure the verbs are conjugated correctly I have to check wether a verb is irregular. To do this I have to check wether a verb stem ends with a certain character or not.
The problem is, however, that a computer sees an entire syllable of a Korean word as a character, not the individual 2 or 3 letters that form that specific syllable, but I need to compare the final letter of a syllable to do it correctly.
For example the Korean verb 춥다 is an irregular verb and we can tell because the verb stem 춥 has ㅂ as the final letter. Yet 춥 is the char, not ㅂ in the case of the verb stem. So this does not work:
verbStem = "춥";
verbStem.EndsWith("ㅂ");
I am currently a bit puzzled on how to make this work and thus I would be quite happy if I could get some directions.
Using the popular Korean Q&A service 지식IN (link to orginal answer) I was able to find the answer to my question. I am so grateful to.
The first step is to seperate the individual letters by normalizing the string. This is done using Normalize method:
string a = "안녕";
string b = a.Normalize( System.Text.NormalizationForm.FormKD);
When using the Normalize method with the Korean string it will be split into its individual component unicode characters.
However, the extremely helpful answer at 지식IN did not stop there with helping me with directions. It pointed out I needed to be aware that even when it has been split there is a different unicode for characters depending whether it is in the initial possition or not and thus I will have to use the appropriate unicode for it. ('ᄋ' is different from 'ᆼ') The unicodes for these are found at Hangul Jamo (Unicode block).
I am so glad someone managed to answer this question for me, but I felt I ought to write out the answer at Stackoverflow as well since you might never know someone else might want to learn how to do something similar.
I'm currently working on a program that loads up a text file, searches through it to find a specific structure, and then replaces a certain part of that structure with different text.
The structure I need to find and extract is "N"(N) where N is any character. For example. Lets say I had a text file like this:
Everyone knows the saying "Do not do more than you can do" (Jim Doe).
Well, I'm here to tell you that this saying is awesome. Here is
another, "The sky is blue and the sun is bright" (Sally Wantsmore).
I would want to be able to match the text "Do not do more than you can do" (Jim Doe) along with "The sky is blue and the sun is bright" (Sally Wantsmore).
I don't think there is really a way to do this with a regular expression from the best of my knowledge. I've been trying for the last few days. I can't get it to work, it's a recursive pattern by nature. My question is, how would I go about writing C# code to parse through and find these patterns. I would like to do something where I can find the position within the string and the length, that way I can then extract it from the string.
EDIT
I need to be able to capture all characters in the quote. This means that there could also be another set of quotes within the quote and even another set of parenthesis. This means that the structure could also contain a match within itself.
I'm now trying to use this expression because I need to be able to capture all characters within a quote: \"(.+)\" \(([\w ]+)\)
The listed answers below both work. However, I've discovered a limitation. There is a possible recursive structure to this. The problem I am currently having is when there is a "N"(N) inside of a "N"(N)". For example:
"Random quote" (random person) Here is a fun saying, "The sky is blue and
the sun is bright, some even say "really bright" (others)" (Sally
Wantsmore).
This presents many problems. There is only one match because it takes the very first ", and then finds the last " just after (others) and finds the set of parens for (Sally Wantsmore) and only finds that match. However, I desire for it to find all the matches, especially the beginning one and last one separably, and even the inner quote. Is this possible with Regular expressions? If not, how do I go about solving this with Recursive c# code.
The following regex should find the two things you're looking for:
\"([\w ]+)\" \(([\w ]+)\)
In C# you can use Regex.Match to retrieve the two items in brackets.
An example on how you could have it in C#:
var quotes = Regex.Matches(#"Everyone knows the saying ""Do not do more than you can do"" (Jim Doe). Well, I'm here to tell you that this saying is awesome. Here is another, ""The sky is blue and the sun is bright"" (Sally Wantsmore).",
"(?<Quotes>\"(?<Text>[\\w ]+)\\\" \\((?<Author>[\\w ]+)\\))", RegexOptions.Singleline);
foreach (Match quote in quotes)
{
var text = quote.Groups["Text"].Value;
var author = quote.Groups["Author"].Value;
Console.WriteLine($"Text: {text}, Author: {author}");
}
I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.
I have a digital check scanner that is able to capture the MICR line from the check. It will return the MICR line in raw format as a string, with delimiters to separate the account number, routing number, and check number. However, each bank formats this MICR line differently, so there's no standard way to parse this data.
Some companies I have tried are Inlite Research Inc and Accusoft Pegasus. The API from Inlite Research works for some banks, but cannot read Bank of America checks correctly. I'm still testing out the API from Accusoft.
What I am asking is if anyone know of an API that will accurately parse the MICR line for the different components. Is there an API that will let me add new definitions of check format if I encounter a new check that the API cannot handle correctly? Or, if anyone know how to or has written a routine to parse the MICR line.
I would appreciate any help I can get. Thank you.
Sorry for the late reply. I didn't see any answers to the question so I thought nobody responded.
To answer the questions above, I found a solution after thinking the problem over and talking with various vendors. The Check scanner that I'm using is already able to read the MICR line. The problem lies in parsing the MICR line for relevant information such as the routing transit number, account number, check/serial number, and amount (if there is one). After speaking with a handful of 3rd party companies and trying out available trial versions of MICR parser, I come to the conclusion that there is no universal parser out there. I'm still faced with the problem of the non-comforming On-Us field. Each bank formats this field differently. Sometimes the symbols are arranged differently as well. So, I decided to write my own parser. I think this is the most logical way to proceed as I've been informed by these 3rd party vendors that they each roll their own parsing software.
The way I wrote the parser was I kept a table of MICR line patterns. Each time I encounter a new MICR line format, I will update this table. My parser will match any check scanned against this table and if it finds a match, it will use that pattern to parse the relevant information.
I hope my experience and the solution I came up with will also help those who ran across the same issue.
Thank you for all those who responded and good luck.
The basic pattern of a MICR:
xxxxxxxxxxx /rrrrrrrrr/ ooooooooooo baaaaaaaaaab
where 'x' is AuxOnUs, 'r' is routing number, 'o' is OnUs, and 'a' is amount, with 'b' and '/' are special MICR symbols.
A minimal MICR line is just:
/rrrrrrrrr/ ooooooooo
AuxOnUs is generally only used by business checks, and it pretty much always means there is a serial number.
Routing number is always consistent, it's the only part of the MICR that is universal.
Amount is generally not encoded in the MICR, but sometimes it is.
OnUs is the tricky part. It normally consists of the check serial number and the account, but each bank handles it differently. Usually the serial number will be 4 digits, but it may be 5 or more. If there's an AuxOnUs field, you can be pretty sure the OnUs is just the account number.
The OnUs can contain spaces and dashes. It would be nice if there were a consistent way they were divided, but I've seen so many variations, I think it's better to just leave it as an "OnUs" field instead of separating it into serial and account, unless you're the paying bank, in which case you should know what format your own checks are.
This should be the correct answer based on my research as well. MICR patterns are too varied to reliably parse without having a collection of regex matching patterns to pull the relevant information. What would be nice is to see the collection of regex patterns you have come up with with group names such as:
<(?<checkNumber>[0-9\s]*)<[0-9\s]*:[0-9\s]*:.*
6 years after this question was originally asked, and I have run across this question numerous times in the past 2 weeks. I finally found an ACTUAL solution, and how to properly parse a MICR line. I've written some code to do so and it works on 99.9% of checks I've scanned this far, so I have to share and make sure people understand how this should be done.
For 11 years I have done this job. We have always used Magtek check scanners. Recently I decided to move to an imaging scanner so we could get scans of all our checks. I went with Panini check scanners. Unfortunately, their API doesn't break apart the MICR line, but our Magtek scanners were programmable to give us whatever we wanted. I created a basic string that could be matched with a pattern every time. It would always come out as: <aaaaaaaaa/bbbbbbbb/ccc> where a is route number, b is account number, and c is check number. Over and over I keep wondering how the scanner, just a simple serial device, can figure it out and get it right EVERY SINGLE TIME for a decade.
I started by using Patrick's own answer, sort of, to build a table of MICR patterns I hadn't seen before. Problem is that I ran to a point where one pattern would get a close match to another check and the data would be off slightly. I then tried doing it based on route number until I ran across two checks from BofA that had identical route numbers and completely different MICR lines. I was so disappointed that my face met my desk in frustration.
After much more research, the proper way is left-to-right parsing of the MICR line. MICR lines are left-to-right, and of course the field giving us the most trouble is the on-us field. All my example snippets are C# code.
Start by looping through the string backwards:
for (int i = micr.Length - 1; i >= 0; i--)
Evaluate each character as you loop. If your first character is the amount character, it's a business check. Read until you get another amount character, then save that value. If the next character is the on-us symbol, assume that the check number is at the far left of the on-us field. If the next character is a digit, keep reading and filling a buffer (REMEMBER YOU ARE WORKING BACKWARDS!) with the digits until you reach the on-us character. If your buffer contains only digits, that's your check number. If it's empty, just move on and collect the entire on-us field in a buffer until you reach the transit character. Once you reach the transit character, keep reading and filling your buffer until you reach the next transit character. Your buffer is now your routing number. If it's a business check, You still have more characters to read. Keep reading until you reach ANOTHER on-us character. You've now reached the auxiliary on-us field, which should be the check number. Read until you reach the next on-us character and that should be the end of your string. You now have your check number.
Now, look at the value you stripped from the regular on-us field. If you have a check number, then that's your account number. If you DO NOT have a check number, then you should split the on-us field by spaces, and assume that your far left set (array element 0) of digits are your check number. HOWEVER, if after splitting by space you only have ONE element in the array, that means the on-us field likely contains dashes separating the items. Split the on-us field by dashes and assume that your far left array element is the check number and the rest are your account number. I've seen some that have as many as 3 dashes in the on-us field, like this: nnnn-1234-56-7, where nnnn is the check number and the rest is the account number.
Once you've got your account number separated from check number, strip any miscellaneous characters (spaces, dashes, etc.) from it and you're done.
This is my solution to all my MICR problems. Hopefully it helps someone else.
Thanks goes, partially, to this document: http://www.transact-tech.com/uploads/printers/files/100-9094-Rev-C-MICR-Programmers-Guide.pdf
I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.
Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.
Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.
This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.
You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.