I would like to parse open office supporting hunspell formatted aff and dic files.
English aff and dic files can be downloaded from here for example : http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice
I want to scan each line of the given .dic file and generate every possible word of the each line with the provided .aff file
How can i do that?
I have installed NHunspell framework but it does not have that feature : https://www.nuget.org/packages/NHunspell/
For example for the english language lets consider
make/UAGS
make can be make, made, makes, making etc
Now i need parser to give me all these combinations. How can i obtain them? Ty very much
So basically i want to scan each line of the dictionary and generate all possible words from the word of that line and i dont know how can i do that
I can also write my own parsers, but it seems to me rules are pretty complex and there are no detailed and easy documentation about this
Here what i want basically. The image explains very clearly
Giving analyze/ADSG, en.dic and en.aff file and obtaining all the following words
analyze, analyzes, analyzing, analyzed, reanalyze, reanalyzes, reanalyzing, reanalyzed
If you want the entire database you may execute unmunch:
unmunch dictionary.dic dictionary.aff
Note that the current implementation of unmunch in hunspell has a limitation of maximum number of words, affs, and length of generated words. So, unmunch may fail if the target language is beyond the limits of unmunch.
If you want just the list of possible words that can be generated from an entry, you may use wordforms:
wordforms dictionary.aff dictionary.dic word
Related
I am playing around with a sentence string entry for a project I'm working on in C# and wanted to see if there was an alternative way to search for a verb using a built in function.
Currently, I am using a database table with a list of regular verbs and cycling through those to check if there is a match but wanted to see if there would be a better way to do this?
Consider the following input:
"Develop string matching software for verb"
Program will read the string and check each word,
if (word == isVerb)
{
m_verbs.Add(word);
}
Short answer :
There is a better way.
Long answer :
It's not that simple. The problem is that there is no inbuilt language functionality into the string class in C#. This is an implementation detail that rests on the developer's shoulders.
You have some grammatical (or perhaps lexical is a better word) issues to consider as Owen79 pointed out in his comment. Then there is the question of environment / resource restrictions.
You have a few options available to you :
Web based dictionary services. You can query those with the words of your sentence and get back the 'status' of each word. Then you will take only the statuses you want, like verbs for instance. Here is a link to DictService which also includes a C# code sample.
A text / xml / other file based solution. Similar approach, you simply look up the words in the file and act according to the presence or absence of the word in the file. You can cache (load into memory) the contents of the file to save on IO operations. Here are the links to lists of regular and irregular verbs.
Database solution is identical to the previous one with the exception of loading contents into memory. That part may be unnecessary but that depends on your implementation requirements.
Bottom line each solution will require some work but whatever option you go for the key aspects to consider are the platform and the resources available to you. If computational speed is a concern you will most likely need to do some tricks to cut down on lookup times etc.
Hope this helps
you could load the common verbs from disk in a text file. If you have lots of verbs and worry about memory you could bucket them into common and uncommon or alphabetically then load in the dictionaries if needed
If you don't want to use the databse option (although highly recommanded), then you need to put them in a data structure (e.g. array or list). You can then use powerful System.Linq extension methods.
For example:
string[] allVerbs = new[] { "eat", "drink" }; // etc
string s = "Develop string matching software for verb";
var words = s.Split(' ');
foreach (var word in words)
if (allVerbs.Contains(word.ToLower()))
m_verbs.Add(word);
I'm creating a program that reads a scanned hand written document and coverts it to text. The recognized words must come from a dictionary of about 300 words that I create. As an example, if the hand written word is recognized as "heilo", but my dictionary only contains "hello" and "world", it should convert it to "hello". However, if it recognized it as "planet", it shouldn't match it to anything. I think a possible approach would be to create a score of how closely the recognized word matches each word in the dictionary. If it doesn't get a minimum score, then no match is found.
I'm writing the application in C#. Are there any libraries/examples available that be do something like this, or would I have to code everything from scratch?
Thanks
There is nothing in the standard libraries to compute the distance between words, but there are plenty of examples you can find on the internet: look up "edit distance" or "Levenshtein distance". The idea is to measure the similarity in terms of the number of changes to the first string in order to make it a second string. The distance between "heil" and "hello" is 2, because you need to replace "i" with "l" (first edit), and then append an "o" (the second edit).
When looking for an implementation or implementing your own, avoid the trivial implementation with a 2D array, because it's not memory-efficient. Use the modification with O(min(m,n)) memory requirements instead of the "naive" O(m*n).
I have no lib at hand to do what you need but searching the web knowing that you want to calculate the Levenshtein Distance might help you in your search.
Perhaps you should start with a spell checker - there are a number of libraries available that do this.
There are a few c# snippets online that will get the ball rolling:
Levenshtein:
http://www.dotnetperls.com/levenshtein
Boyer-Moore:
http://www-igm.univ-mlv.fr/~lecroq/string/node15.html#SECTION00150
Based on those, you can easily implement your own Word Matcher module.
I have a xml with two properties: word and link.
How can I replace the words on a text to a link using the xml information.
Ex.:
XML
<word>dog</word>
<link>http://www.dog.com</link>
Text: The dog is nice.
Result: The dog is nice.
Results OK.
The problems:
1- If the text has the word dogs the result is incorret, because of "s".
2- I've tested doing a split by space on text to fix it, but if the word is composed like new year the result is incorret again.
Does anyone have any suggestions to do it and fix these problems (plural and compound words)?
Thanks for the help.
You can use Lucene.Net's contrib package Snowball for stemming (words->word , came->come , having->have etc.). But you will still have troubles with compound words
If you roll your own solution, I have had good success with the .NET pluralization capabilities:
http://msdn.microsoft.com/en-us/library/system.data.entity.design.pluralizationservices.pluralizationservice.aspx
Essentially, you can pass a word in its plural form and receive a singular version and vice versa.
This could be fairly intensive depending on how often the content changed, i.e. this wouldn't be a good choice to search thousands of words in real time.
Assuming that you can pre-process/cache the results or that the source file is small, you could:
Run Once
Identify all candidate words from the source file.
Parse/split phrases and pass them through the pluralization libraries to determine their plural counterparts.
Generate (and precompile) simple regular expressions to locate the words that you do want to match. For example, if you want to match "dog" but not "dogs" you could create a regex like dog[^s] which could then be executed against the text.
Run Whenever a Search/Replace is Needed
Run your list of source expressions against the text in question. I would suggest ordering the expressions from shortest to longest (otherwise a short expression may replace a word that was just parsed by a longer expression).
Again, this would be processor intensive to run in real-time (most solutions will be). As always, if you are parsing HTML, you should use an HTML parser, not a regular expression. In this case, you might use a proper parser to locate all text nodes and then perform the search/replace on them.
An alternative solution would be to put the text and keyword list into a database and use SQL Server Full Text Indexing which tends to be pretty smart about these things and supports intelligent match predicates. You could even combine this with a CLR stored procedure to handle things that .NET excels at (like string parsing).
Regardless of the approach, this will not be an exact science.
You're likely going to need a dictionary. Create a text file/XML file that contains both the singular and plural forms of the words you want. At runtime, load them into a Dictionary<String, String>. Then look up the value of <word/> in the dictionary and extract its singular value.
I'm going to write a program that takes a URL and counts the occurrences of EVERY single 1-word, 2-word, and 3-word phrases in the webpage (and possibly x-word phrases).
Here's the best algorithm I could come up with:
1). strip html tags
2) make everything lowercase
3) split the text on space and put them all into an array
4) iterate over each word, and for each word you must: put word[i], word[i+1], word[i+2] into a hashtable.
Every time u have a collision you increase the word count for that word or 2-3 letter word phrase.
My questions are:
1) Can anyone provide any more efficient solutions in terms of space and runtime?
2) Are there any easy ways to do #1 in C#?
I can probably use a dom parser and parse out all the inner text maybe.
Depending on your case, You might be oversimplifying the problem and/or You may end up putting a lot of effort implementing functionalites that already exist in some libraries. So this will not be an direct answer but suggestion on what path to take in tackling this problem.
Process You want to implement is called information retrieval. It is very broad and complex but luckily there is a lot of research in this area. Part of it is extracting word ngrams (ngram is set of consecutive letters or words in sequence).
Let me show you some additional problems you should think of ahead:
is the capitalization of letters in word important?
is dot the only sign that You want to use to mark the end of sentence?
do You want to exclude stop words? Stop words are words You don't want to include in phrase like 'a', 'the', 'I', 'my' and so on.
do you want to stem words? Convert words from their original form to their root form, like plural to singular form: basketballs -> basketball
And for extracting pure text from HTML:
extract only text shown on page?
extract hints also? (like those shown when hovering mouse over picture)
Any other non-visible text (meta tag and so on)
There are libraries that perform searching and extracting information from raw material. "Raw material" means that You have to process document (html, doc, pdf, image, ...) and turn it into text in order for search engine to index it (extract phrases, for instance). Once document is indexed it can be searched. One such library for .NET is Lucene.NET. It supports different stemmers, analyzers, filters.
I am not sure but i believe there are libraries for extracting text from html also.
Basically, your approach may work in some simpler scenarios where not so small error-level is acceptable. I recently gain interest in information retrieval and found it really complex and interesting. You may get benefits researching this topic depending on your goals. There is a lot of info here on stackoverflow as well as the rest of Internet.
And if You decide to go this way, there is much more info on Lucene (orioginal Lucene JAVA version, Lucene.NET is port to .NET) than on Lucene.NET. So if You don't find answer for Lucene.NET immediately do a search on Lucene discussions.
To answer your question #2.
HtmlDocument doc = WebBrowser1.Document;
string text = doc.GetInnerText();
If you want to make it more efficient - use a suffix trie (you may have to write your own)
http://en.wikipedia.org/wiki/Suffix_trie
A suffix trie basically makes searching through strings depend on the length of the string instead of the length of the array. Its the sort of thing they use in search engines.
I need to introduce some text macros, for example:
"Some text here, some text here #from_file[a.txt,2,N] and here and here"
The #from_file[a.txt,2,N] macro should get 2 random lines from a.txt and join them with new line character another #from_file[a.txt,5,S] - take 5 random lines and join with space
I of course need some another macros: #random[0-9] - random number, #random[A-B,5] - random string with 5 characters
Macros can be in another format etc: {from_file:a.txt,2,N}
My first idea was to use regular expressions - but maybe exist another solution for my problem?
It sounds like you want to create some sort of "general purpose" text-macro system, and while I'm sure this can be done with regexps, what you want basically boil down to what you want to be capable of, and how extensive & flexible it needs to be.
You basically need to define your grammar and constraints. Can the file-name contain the macro-block terminator-character '}' ? If so, does it need to be escaped? Should escaping be supported? Are spaces within a macro-block allowed?
Basically find out how you want things to work, preferably as constrained as possible, as this means you can implement a simpler solution, and there might not be any need for a full blown parser and similar ilk.
Maybe a regex-based solution will be sufficient (although most certainly not very good). But before you can tell that, you need to spec better ;)