I require POS tagging for my files in the corpus.
I have successfully followed the installation instructions of SharpNlp
I am using the binary version
I created a new c# project in: E:\sharp\sharpapp
location of Models Folder is: E:\sharp\sharpapp\bin\Models
location of my SharpNlp Binary is: E:\sharp\SharpNLP-1.0.2529-Bin
I have also followed the instructions to modify both .config files "ParseTree.Exe" and "ToolsExamples.Exe"
Now in my c# project I have a class called tagging.cs where I have to access my corpus text files and do POS tagging for those files. Can anybody help me how can I make use of SharpNlp to do so
Please provide steps to do so.
In a nutshell, SharpNLP is
a port to C# of OpenNLP Tools and OpenNLP MaxEnt
a connector to WordNet
a set of pre-computed models, mostly for the English language
utility modules such as integration with SQLLite
It should be noted that the port of the OpenNLP libraries is relatively informal, with various class and property name changes, possibly loose preservation of features and semantics and no apparent connection with the original Java projects' lifecycle. This situation will likely ensure that in time the OpenNLP portion of SharpNLP will be more akin to distant cousins than twin sisters...
Never the less, it is possible to use examples and documentation from OpenNLP to complement the relatively thin support material available with SharpNLP. Between the source code of SharpNLP and resources like the OpenNLP API reference and the OpenNLP wiki, one can generally map things and adapt accordingly.
A loose conductor could be the study of this particular source file which makes use of OpenNLP in a way that seems close to what you may need. Note the name changes between OpenNLP and SharpNLP, for example POSTTaggerME class becomes MaximumEntropyPosTagger and the Parse() method and its overload turn to TagSentence() and such.
A more general hint is to understand...
...the sequence of steps typically necessary to perform POS Tagging.
This is a very high-level approximate description but, I think, useful.
get the text to be tagged = string(s) of text
Initialize a text parser
parse it = an "array" (or other container) with individual tokens i.e. words and punctuation characters.
initialize the POS Tagger, in particular tell its which model it should use
feed the [ordered] sequence of tokens to the POS Tagger
Ta dah! Use the POS tags for the eventual purpose of your NLP application.
Note how the above sequence assumes that the model is readily available.
The model is a representation of the statistical "profile" of text in general, obtained from training the Tagger with a set of text readily tagged.
SharpNLP comes with a model for generic English language, but in order to tag other languages or if the specific corpora to be tagged belongs to a particular domain (say medical reports or Tweets or...) it may be preferable to re-train the tagger to improve its precision.
Open/SharpNLP as most POS Taggers, whether stand-alone or their API, typically include features to train them (= to produce a model given a sample set of text readily tagged) and also to verify the quality of the model/tagger so produced (= to compare the tags produced on a test set, with the tags expected for this set).
Kindly read through the article that I have written for this. It will give you a detailed step by step method with sample code snippets.
Easy way of Integrating SharpNLP into your project in Visual Studio
I hope this was useful.
Related
Is there any C# algorithm by which personal and place names can be extracted from text?
e.g., given the following text:
St. Mark died at Alexandria, in Egypt. He was martyred, I think.
However, that has nothing to do with my legend. About the founding of
the city of Venice--
(taken from "The Innocents Abroad" by Mark Twain)
...is there any way to extract:
St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice
?
I realize that there is no way to get 100% accuracy (where all place names and personal names are captured, and no "false positives" are added), but 80% accuracy could be very valuable.
I understand that each word could be compared with an encyclopedia or some such, but there must be a better way. Also, how could the algorithm know to combine "St." and "Mark" and to see "Alexandria, in Egypt" as "Alexandria, Egypt"?
I noticed that the links provided here are a bit dated. One project that is still active (and free [correction: GPL, so free for non-commercial]) is the Stanford Natural Language Processing (NLP) libraries (https://nlp.stanford.edu/software/). You can demo their Named Entity Recognition (NER) here. It even has a .NET wrapper (http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordNER.html).
Microsoft also offers many similar algorithms through Azure Cognitive Services. You would be most interested in Entity Linking (https://azure.microsoft.com/en-us/services/cognitive-services/entity-linking-intelligence-service/)
I hope helps future viewers.
You are best off using some kind of API that will be able to perform this kind of entity matching, as what you are asking is potentially very complex and requires some degree of semantic textual analysis backed up by a large database. I'd recommend at looking at APIs such as:
OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service
Calais supports a rich set of semantic metadata, including entities, events and facts.
Alchemy API - Entity Extraction API
AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.
I am looking for an LDIF parser for C#. I am trying to parse an LDIF file so that I can check objects don't exist before adding them. Adding them when the already exist using ntdsSchemaAdd) causes an entry in the error logs.
A quick websearch revealed: http://wiki.github.com/skradel/Zetetic.Ldap/. They have provided a .net API.
From the page:
Zetetic.Ldap is a .NET library for
.NET 2 and above, which makes it
easier to work with directory servers
(like Active Directory, ADAM, Red Hat
Directory Server, and others). Some of
the key features of Zetetic.Ldap are:
1.LDIF file parsing and generation – Read and write the file format used
for moving data around between
directory systems
2.LDAP Entry-oriented API with change tracking – Create and modify directory
objects in a more natural way
3.LDAP Schema interrogation – Quick programmatic access to the kinds of
objects and fields your directory
server understands. Learn if an
attribute is a string, a number, a
date, etc., without lots of manual
research and re-parsing
4.LDIF Pivoter – Turn an LDIF file into a (comma or tab-delimited) flat
file for analysis or loading into
systems that don’t speak LDIF We built
the Zetetic.Ldap library to make
directory projects and programming
faster and easier, and release it here
in the hopes that others will find it
useful too. As far as we know, this is
the only .NET library that really
understands the LDIF specification.
Download link: http://github.com/downloads/skradel/Zetetic.Ldap/Zetetic.Ldap_20090831.zip
I would parse it myself.
If you look at the LDIF RFC for the EBNF, you'll see that it's not a very complex grammar.
I've parsed a large amount of LDIF before using Regexes reliably. Though your mileage may vary.
In my next project I will have to implement an automation solution to test a hardware device. basically, the test involves an industrial robotic arm picking a device to be tested, holding it at some specified position and then using a series of other devices like motors and sensors to exercise several areas of the product to be tested.
So my test automation solution will need to communicate with several controllers, either issuing actuation commands or getting information from sensors.
The first idea that comes to mind is to define the sequence of steps for each controller in a custom XML language. In this language I'd need to define primitives such as "MOVE", "IF", "WAIT", "SIGNAL" and etc. These primitives would be used to define the operation script for each controller. Each controller runs asynchronous but eventually gets synchronized, so that's the need for things like "WAIT" and "SIGNAL".
I did a basic search on google and the only thing I was able to find was really old stuff (I don't need to comply to industrial standards, it's a small venture) or XML dialects that were designed for something else.
Question is - do you know of any XML standard that I could use instead of creating my own?
EDIT: I'm currently investigating a plan execution language by NASA that looks promising. Name is PLEXIL. If anybody knows anything about it, please feel to contribute.
Have you reviewed PARSL? It's an XML based robotic scripting language which incorporates sensors, looping, and conditional behavior.
XML can be amended to create your 'own standard'. You can define things using a DTD (Document Type Definition) file. In this manner you can create your own way the XML has to look like.
The DTD is a schema that contains the structure and constraints you want to put on your XML file. Have a look here on wikipedia for more info.
Hope this is helpful!
What libraries are there to write C# internationalized applications?
Typical functionalities that should be contained in the library:
Validation of country specific data (e.g. VAT numbers, phone numbers, addresses,...)
Validation of bank and financial coordinates (e.g. Credit Card numbers, IBAN,...)
Language-specific functionalities (e.g. numbers to words to numbers, summarize,...)
Language specific content filtering (e.g. swearword filtering...)
An example of such libraries in Perl would be the Internationalization/Locale section of CPAN.
What C# solutions are available?
Note: I am not looking for an introduction to the System.Globalization namespace :)
Note 2: Should I desume that there are no options available? Is someone interested in joining forces and create one?
Note 3: Edit to make the question appear on front page in hope of more answers. This isn't such a hard question, how is it possible that Stackers don't ever do i18n?
One project that is working towards a database of globalization, internationalization and localization knowledge is the Unicode Common Locale Data Repository, based on the old ICU project at IBM.
As it is a database of XML data it doesn't contain any .NET-specific code, but as a body of knowledge it is very good.
Only a smallish subset is in the .NET framework. Microsoft hasn't gone near any of the supplemental stuff, like postcode formats, number spelling (for check/cheque amounts), etc. Standard time zone names (from the Olson/tz distribution), etc. are also included, with mappings to the Windows-specific names. Some of the hierarchical locale-specific behaviours also have better support.
I wouldn't say that no one does i18n, but I don't know of any generic tools that can be used for every project. Maintaining a database with all of the information you are looking for would be an epic project. It sounds like what you're looking for isn't a specific C# library, but more a collection of information online that you can draw from. If you were able to find a repository of swear words in various languages (for example), it would be trivial for you to use this in C#. I think that finding a solution that wraps up all of your requirements into an easy-to-use assembly is going to be impossible to find.
Have a look at
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
and
http://www.dotneti18n.com/
String to number and vice versa can be dones as following:
culture = new CultureInfo(locale);
int number = Convert.ToInt32(myString, culture.NumberFormat);
string str= Convert.ToString(myNumber, culture.NumberFormat);
As to checking VATS and adresses, I'm interested in that too, haven't found anything useful so far.
Not exactly a "library", per se, but I've actually ran into a great service (for pay), by a company called E4X (former client of mine).
What they provide is complete localization of your ecommerce site, including language translations, currency exchanges, local billing and handling of financial transactions including region-specific taxes etc, and more. They even deal with logisitics of physical shipping...
Worth looking into, for an ecommerce business. Let 'em know I sent you... ;-)
That's a huge endeavor. Let's start with one simple problem: phone numbers. Libphonenumber Google library at http://code.google.com/p/libphonenumber/ has a C# port at https://bitbucket.org/pmezard/libphonenumber-csharp with notes at http://blog.thekieners.com/2011/06/06/using-googles-libphonenumber-in-microsoft-net-with-c/. Appears to be a good library for handling both US and int'l numbers.
I'm using Visual Studio (2005 and up). I am looking into trying out making an application where the user can change language for all menues, input formats and such. How would I go on doing this, as I suppose that there is some complete feature within .Net that can help me with this?
I need to take the following into account (and fill me in if I miss some obvious stuff)
Strings (menues, texts)
Input data (parsing floats, dates, etc..)
Should be easy to add support for another language
I'm not an expert with .NET by any means but Localization is never just as simple as "swapping out String values" or "changing date formats". There is much more to be taken into consideration such as layout, proper text placement.
Take Chinese for example. The way you read is top to bottom not left to right. If properly localized the app should take that into account.
http://msdn.microsoft.com/en-us/library/y99d1cd3(VS.80).aspx seems to be a good start though if you're dealing with Windows Forms.
The classic recipe is: design the app with no native language but a localization facility, and develop an initialization into one language (e.g., English). So you build the app and localize it into English every night; without the localization step it would not be usable. Do that well, and the resources for the initial sample localization can be replaced with those for any other language. Take into account non-roman scripts from the beginning. It's much cleaner to have a no-language app that always requires localization rather than a language-specific app that needs to have its native language subtracted and a replacement added.
For strings you should just separate your strings from your code (having an XML/DLL that will transform string IDs to real strings is one way to go). However you do need to make sure that you are supporting double byte characters for some languages (this is relevant if you use C/C++).
For input data what you want is to have different locale's. In Java this is relatively easy, and if you use C# it probably is quite easy also. In C/C++ I don't really know. The basic idea is that the input parsers should be different based on the locale selected at that time. So each field (textfield, textbox, etc.) must have an abstract parser that is then implemented by a different class depending on the locale (right to left, double byte, etc.).
Check the Java implementation for details on how they did it. It is quite functional.
You definitely need to be using the .NET ResourceManager and the resx file xml format, however there are a number of approaches to using this.
It really depends on what you are wanting to achieve. For me I wanted a single xml resource file (for each supported language) that could be modified by anyone. I created a helper class that loaded the global resource file into ResourceManager (once only) and I had a helper function that gives me the required resource for a given name. The only disadvantage in this approach was that I could not leverage dynamic binding of resources to properties.
I found this better and easier to manage than multiple or embedded resource files for every form. Additionally exactly the same approach can used in an ASP.NET application. I also found this approach means that outsourcing translation of resources and shipping language packs to customers much more manageable.
Microsoft's recommended approach is to use satellite assemblies, as described in Packaging and Deploying Resources. If you're using a ResourceManager to load resources, .NET will load the correct resources for the CurrentUICulture. This defaults to the user's current UI language setting in Windows.
It is possible to localize Windows Forms either through Visual Studio or an external tool, WinRes.exe. This article describes WinRes and how to use Visual Studio to localize the form.