I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.
You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words
If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.
You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.
Related
I would like to parse open office supporting hunspell formatted aff and dic files.
English aff and dic files can be downloaded from here for example : http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice
I want to scan each line of the given .dic file and generate every possible word of the each line with the provided .aff file
How can i do that?
I have installed NHunspell framework but it does not have that feature : https://www.nuget.org/packages/NHunspell/
For example for the english language lets consider
make/UAGS
make can be make, made, makes, making etc
Now i need parser to give me all these combinations. How can i obtain them? Ty very much
So basically i want to scan each line of the dictionary and generate all possible words from the word of that line and i dont know how can i do that
I can also write my own parsers, but it seems to me rules are pretty complex and there are no detailed and easy documentation about this
Here what i want basically. The image explains very clearly
Giving analyze/ADSG, en.dic and en.aff file and obtaining all the following words
analyze, analyzes, analyzing, analyzed, reanalyze, reanalyzes, reanalyzing, reanalyzed
If you want the entire database you may execute unmunch:
unmunch dictionary.dic dictionary.aff
Note that the current implementation of unmunch in hunspell has a limitation of maximum number of words, affs, and length of generated words. So, unmunch may fail if the target language is beyond the limits of unmunch.
If you want just the list of possible words that can be generated from an entry, you may use wordforms:
wordforms dictionary.aff dictionary.dic word
I have a lot of text data with different structure. I need to extract parts of these texts based on some text-based rules. I would use regular expressions but unfortunately the people who are using the application have never heard of it.
Basically the app does the following thing:
Load the data into a textbox
Type the structure of the output as a simple set of rules into another textbox
Receive the results in a 3rd textbox
Examples of data structures (I have megabytes of this data):
Label1: value1, measurement
Label2; value2; something else
Nr, value3 (comment)
...
I need some other approach that I could use instead of regular expressions. It can be extremely simple because all I need is one value from every row.
From the example above I have to obtain the following structure:
"value1, value2, value3"
Is there a simpler alternative to regex? Did someone already implement something like this?
I can also imagine that I am approaching the problem from the wrong angle, like forcing the simple user to write data extraction rules. In this case the question is transformed to something more generic like "How can build an application that lets a very simple user extract data from a separate texts?"
Edit:
I have the following simplest as possible matching implemented for them:
File content:
"Strain at break Ax2";"Unknown"
"Strain at break Ax1";"Unknown"
"Strain at break";"Unknown"
"Yield point strain";"Unknown"
"Uniform elongation";25.4087;"%"
"Tensile strength";261.323;"MPa"
"End test phase Yield point";1;"%"
"Maximum tensile force";5.22647;"kN"
Pattern:
"Tensile strength";(?<value>[^;\n]*);
"Maximum tensile force";(?<value>[^;\n]*);
Still too complex. The problem is if I start replacing the ugly part with another string to obtain for example:
"Tensile strength", [First value after]
I loose all the generic nature of the extraction because every file looks different from this one.
Take a look at the FileHelpers library. It allows runtime generation of file layouts and I think the one that would help in your example is the DelimitedClassBuilder.
In your case, I'd probably use FileHelpers to parse the record definitions into the DelimitedClassBuilder and then use the result to parse your records.
I have solved the issue by defining the rules as regular expressions. After the rules were defined I defined a wrapper rule-set that was easier to read by the users.
Ex. to extract a value from a line
Maximum amount of Sheet Drawing Force= 35.659695[kN]
I defined the regular expression
{0}=\s*(?<value>[^[\n\r]*)
then let the user define the name of the field. The {0} placeholder was then replaced with the name of the field and the regular expression applied.
I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
http://en.wikipedia.org/wiki/Suffix_trie
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.
I need to introduce some text macros, for example:
"Some text here, some text here #from_file[a.txt,2,N] and here and here"
The #from_file[a.txt,2,N] macro should get 2 random lines from a.txt and join them with new line character another #from_file[a.txt,5,S] - take 5 random lines and join with space
I of course need some another macros: #random[0-9] - random number, #random[A-B,5] - random string with 5 characters
Macros can be in another format etc: {from_file:a.txt,2,N}
My first idea was to use regular expressions - but maybe exist another solution for my problem?
It sounds like you want to create some sort of "general purpose" text-macro system, and while I'm sure this can be done with regexps, what you want basically boil down to what you want to be capable of, and how extensive & flexible it needs to be.
You basically need to define your grammar and constraints. Can the file-name contain the macro-block terminator-character '}' ? If so, does it need to be escaped? Should escaping be supported? Are spaces within a macro-block allowed?
Basically find out how you want things to work, preferably as constrained as possible, as this means you can implement a simpler solution, and there might not be any need for a full blown parser and similar ilk.
Maybe a regex-based solution will be sufficient (although most certainly not very good). But before you can tell that, you need to spec better ;)
Let me explain with an example.
We have the following text:
"Comme Il Faut was founded in 1927. The tobacco company is most well known for its reputation of producing customized private label brands for its partners worldwide".
This is normal text. But the following text:
"CommeIlFautwasfounded in 1927. The tobacco companyi most wellknown foritsreputation of producing customizedprivatelabelbrands foritspartners worldwide"
This is text anomaly: typos, words without a space, maybe something else.
How to search for such anomalies?
What algorithms are there for this (statistical)?
It is desirable that the result was a percentage: for example, 80% of the anomalies.
Thanks.
Construct a Trie tree with all the known words in the dictionary.
Take each word that apears in your text and try to find it in the Trie tree. If you don't find it then try to match prefix of length-k. If you find a match then you apply the same procedure to the rest k characters. It's recursive and it could catch more than two concatenated words
Another simple method is to use the edit distance algorithm. This algorithm calculates the minimum number of edit operations (insert, delete or replace) that have to be performed to transform the string into the other string. With some additional logic you can easily get this algorithm to output the operations as well.
This however assumes you have both the correct and the broken string. If you only have the broken string this get's a lot harder. In that case I would suggest you either try the trie approach mentioned before, or you use some external library like ispell to have it handle this logic. You could have a look at the code for ispell or it's variants to see how complicated such a task might get.
A couple of links that could be helpful:
http://www.codeproject.com/KB/cs/spellcheckdemo.aspx
http://www.codeproject.com/KB/recipes/spellcheckparser.aspx