How to split a string into meaningful words

How to split a string into meaningful words - c#

I wanted to split a string into possible word string. What approach should I follow.
Given string : thisisapineapple
solution 1: this is a pineapple
solution 2: this is a pine apple
Please suggest and explain the possible alogriths to get above solution.
Thanks :)

To answer your question, Knuth-Morris-Pratt algorithm is powerful and not terribly difficult to implement.
Use strings from /usr/share/dict/words or /usr/dict/words as the patterns.

You need a scanner-less, GLR parser. They can handle words run together like this and can return ambiguous results. My own NLP library (AboditNLP) does this. Wordnet is a good source for the words.

Related

Anomaly in text

Let me explain with an example.
We have the following text:
"Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide".
This is normal text. But the following text:
"CommeIlFautwasfounded in 1927. The tobacco companyi most wellknown foritsreputation of producing customizedprivatelabelbrands foritspartners worldwide"
This is text anomaly: typos, words without a space, maybe something else.
How to search for such anomalies?
What algorithms are there for this (statistical)?
It is desirable that the result was a percentage: for example, 80% of the anomalies.
Thanks.

Construct a Trie tree with all the known words in the dictionary.
Take each word that apears in your text and try to find it in the Trie tree. If you don't find it then try to match prefix of length-k. If you find a match then you apply the same procedure to the rest k characters. It's recursive and it could catch more than two concatenated words

Another simple method is to use the edit distance algorithm. This algorithm calculates the minimum number of edit operations (insert, delete or replace) that have to be performed to transform the string into the other string. With some additional logic you can easily get this algorithm to output the operations as well.
This however assumes you have both the correct and the broken string. If you only have the broken string this get's a lot harder. In that case I would suggest you either try the trie approach mentioned before, or you use some external library like ispell to have it handle this logic. You could have a look at the code for ispell or it's variants to see how complicated such a task might get.

A couple of links that could be helpful:
http://www.codeproject.com/KB/cs/spellcheckdemo.aspx
http://www.codeproject.com/KB/recipes/spellcheckparser.aspx

Regular expression in C# , is this possible?

I never use regular expression before and plan to use it to solve my problem but not quite sure whether it can help me.
I have a situation where I need store a rule or formula to build string values like following examples in a database field then retrieve this rule and build the string value.
FacilityCode + Left(ModelNO,2)
Right(PO,3) + Left(Serial,2)
Is this achievable using .net regular expression? Any good tutorial or simple examples of this problem.

Regexp : http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
But it doesn't seems fitting :)

It might be better to code some random string generator. Regex is for searching data not creating data.
The thing to remember about regex is that it is like an aircraft carrier; it does one thing very very well, it does not do other jobs very well at all.
An aircraft carrier moves planes very well on the ocean; it does not make a cheese sandwich well AT ALL!!
That is to say, if you use regex when you shouldn't you will almost certainly use far more processing power than if you used another tool for that job. Html parsing comes to mind.

Regex is provided as part of System.Text.RegularExpressions, but you can't rely exclusively on it. It'll let you search existing strings, but you'll need to implement your own logic for building new strings based on what you find in the existing data.
Also, keep in mind that System.Text.RegularExpressions works differently from regexp in Perl and other implementations. For example, it doesn't recognize POSIX character class definitions.
Since you're new to regex, you might want to check out the "Regular Expressions User Guide" on zytrax.com. It's not as comprehensive as an O'Reilly manual, but it'll do as a start.

c# parse a string that contains conditions , key=value

I m giving a string that contains several different combination of data.
For example :
string data = "(age=20&gender=male) or (city=newyork)"
string data1 = "(job=engineer&gender=female)"
string data2 = "(foo =1 or foo = 2) & (bar =1)"
I need to parse this string and create structure out of it and i have to evaluate this to a condition of another object. eg: if the object has these properties, then do something , else skip etc.
What are the best practices to do this?
Should i use a parser such as antlr and generate tokens out of the string. etc.?
reminder : there are several combinations of how this string is created. but it s all and/or.

Something like ANTLR is probably overkill for this.
A simple implementation of the shunting-yard algorithm would probably do the trick quite nicely.

Using regular expressions may work if the example is very simple, but it will more likely lead to a code that is impossible to maintain. Using some other approach to parsing seems like a good idea.
I would take a look at NCalc - it is mainly focused on parsing mathematical expressions, but it seems to be quite customizable (you can specify your functions and constants), so it may work in your scenario as well.
If this is too complex for your purpose, you can use any "parser generator" for C#. Using ANTLR is one great option - here is an example that shows how to start writing something like your example Five minute introduction to ANTLR
You could also try using F#, which is a great language for this kind of problem. See for example FsLex Sample by Chris Smith, which shows a simple mathematical evaluator - processing the parsed expression in F# would be a lot easier than in C#. In F#, you could also use FParsec, which is very lightweight, but may be a bit difficult to follow if you're not used to F#.

I suggest you to have a look at regular expressions: http://www.codeproject.com/KB/dotnet/regextutorial.aspx

Antlr is a great tool, but you can probably do this with regular expressions. One of the nice things about the .NET regex engine is support for nested constructs. See
http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/
and this SO post.

Seems like you might want to use Regular Expressions to do this.
Read up a little bit on Regular Expressions in .NET. Here are some good articles:
http://msdn.microsoft.com/en-us/library/hs600312.aspx
http://www.regular-expressions.info/dotnet.html
When it comes time to write/test your Regular expression i would highly recommend using RegExLib.com's regex tester.

Ideas for creating a "Did you mean XYZ" feature into website

I'd like to give users the ability to search through a large list of businesses, but still find near matches.
Does anyone have any recommendations on how best to go about this when you're not targeting simple dictionary words, but instead complex names like ABC Business Name?
Regards.

Check out the wikipedia article on Levenshtein distance. It's a fairly simple concept to wrap your head around and pretty easy to implement an algorithm in whichever language you are using, in your case, C#.
I found an example in C# for you here.
Also, here is an example of a spelling corrector from Peter Norvig of Google. It was said on the SO podcast a few episodes ago that Jon Skeet attempted a rewrite of this same algorithm in C#. Not sure if he completed it and/or made it publicly available though.

Consider using Keyword match and edit distance based similarity. Might combine with 'original searched' to 'actually clicked'.

This is probably a crazy solution but could you split the business name by space and then search either all the items or maybe the first couple.
So you might search on 'ABC' and 'Business' but leave out 'Name' as this might take too long.
You might even check to see if the string is of a certain length, then trim and just search on the first say 5 letters.
Have you had a look at "soundex" as a way of searching through your businesses. Again, I think you'd need to split the name by space.

You might check out the SQL Server SOUNDEX and DIFFERENCE functions. SOUNDEX converts a sequence of characters (such as a word) into a 4-character code which will be the same for similar-sounding words. DIFFERENCE gives a number which represents how "different" two strings are based on sound.
You could, for example, create a computed column based on the SOUNDEX function and match on that column later. Or you could use DIFFERENCE in a WHERE clause.

Regular expression for validating a url

I'm a beginner in regexes. My requirement is to validate simple urls to urls with query strings, square brackets etc.. say for eg,
www.test.com?waa=[sample data]
the regex that I wrote only work for simple urls. It fails for the one with square brackets. Any idea?

Do you really need to use regex ?
bool isUri = Uri.IsWellFormedUriString("http://...", UriKind.RelativeOrAbsolute)

I would suggest taking a better look at the following site
http://www.regular-expressions.info/dotnet.html
Without actually seeing the Regex you're using I can't provide much insight. And giving you the answer wouldn't really teach you much either. Give a man a regex and you help him for a bit. Teach him regex and he's good for life

Take a look at the following:
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm

thanks a lot fr reply..
this is what i wrote ..works for query strings too...but it fails while adding []..
/^(https?|ftp)://(?#)(([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+(?#)(:([a-z0-9$.+!*\'(),;\?&=-]|%[0-9a-f]{2})+)?(?#)#)?(#)((([a-z0-9][a-z0-9-][a-z0-9].)(#)[a-z]{2}[a-z0-9-]a-z0-9|(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5].){3}(?#)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?#))(:\d+)?(?#))(((/+([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?# )(\?([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2}))(?#)?)?)?(?#)(#([a-z0-9$_.+!*\'(),;:#&=-]|%[0-9a-f]{2})*)?(?#)$/i

Use this if u want url with http
http(s)?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?
if oyu dnt want http in URL then go for
?://([\w-]+.)+[\w-]+(/[\w- ./?%&=]*)?

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to split a string into meaningful words - c#

I wanted to split a string into possible word string. What approach should I follow. Given string : thisisapineapple solution 1: this is a pineapple solution 2: this is a pine apple Please suggest and explain the possible alogriths to get above solution. Thanks :)

To answer your question, Knuth-Morris-Pratt algorithm is powerful and not terribly difficult to implement. Use strings from /usr/share/dict/words or /usr/dict/words as the patterns.

You need a scanner-less, GLR parser. They can handle words run together like this and can return ambiguous results. My own NLP library (AboditNLP) does this. Wordnet is a good source for the words.

Related

Anomaly in text

Regular expression in C# , is this possible?

c# parse a string that contains conditions , key=value

Ideas for creating a "Did you mean XYZ" feature into website

Regular expression for validating a url

Categories

Resources