How can personal and place names be extracted from text using C#?

How can personal and place names be extracted from text using C#? - c#

Is there any C# algorithm by which personal and place names can be extracted from text?
e.g., given the following text:
St. Mark died at Alexandria, in Egypt. He was martyred, I think.
However, that has nothing to do with my legend. About the founding of
the city of Venice--
(taken from "The Innocents Abroad" by Mark Twain)
...is there any way to extract:
St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice
?
I realize that there is no way to get 100% accuracy (where all place names and personal names are captured, and no "false positives" are added), but 80% accuracy could be very valuable.
I understand that each word could be compared with an encyclopedia or some such, but there must be a better way. Also, how could the algorithm know to combine "St." and "Mark" and to see "Alexandria, in Egypt" as "Alexandria, Egypt"?

I noticed that the links provided here are a bit dated. One project that is still active (and free [correction: GPL, so free for non-commercial]) is the Stanford Natural Language Processing (NLP) libraries (https://nlp.stanford.edu/software/). You can demo their Named Entity Recognition (NER) here. It even has a .NET wrapper (http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordNER.html).
Microsoft also offers many similar algorithms through Azure Cognitive Services. You would be most interested in Entity Linking (https://azure.microsoft.com/en-us/services/cognitive-services/entity-linking-intelligence-service/)
I hope helps future viewers.

You are best off using some kind of API that will be able to perform this kind of entity matching, as what you are asking is potentially very complex and requires some degree of semantic textual analysis backed up by a large database. I'd recommend at looking at APIs such as:
OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service
Calais supports a rich set of semantic metadata, including entities, events and facts.
Alchemy API - Entity Extraction API
AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.

Related

Categorization / classification pattern

This question may be a bit too generic and abstract because I do not know what I'm looking for yet. I do not have too much experience with patterns. I need to know what pattern/ technique I can use to categorize patients in a medical app.
Let's say the hospital has a documentation app with 10 data fields. Dates, numbers, selects, multi-selects.
Every patient that visits the hospital will have it's own specific information.
After input and analysis, each patient must be placed in a category.
Each category is determined by a set of rules. Those rules are created based on some or all fields defined above and their individual value.
In reality I'm speaking about hundreds of patients and hundreds of input fields. So I'm trying to find out whether there is some traditional way of doing this (something more generic) or if I'm stuck with writing tens of "IF and Switch" statements.
PS: this is not a machine learning task

This sounds like a task for a type of algorithm called a rules engine. At their simplest the coding style of rules engines does appear to be a collection of IF ... THEN ... (ELSE...) but rules engines also usually have features such as the elimination of redundant branches, cycle and contradiction detection and so on.
Examples of software packages that provide this are Drools and the BizTalk Business Rules Engine

In addition to Tom W's answer, on a lower level you may find the Specification pattern useful.

PDF Creating Server

I've been tasked to create (or seek something that is already working) a centralized server with an API that has the ability to return a PDF file passing some data, and the name of the template, it has to be a robust solution, enterprise ready. The goal is as follows:
A series of templates for different company things. (Invoices, Orders, Order Plannings, etc)
A way of returning a PDF from external software (Websites, ERP, etc)
Can be an already ready enterprise solution, but they are pressing for a custom one.
Can be any language, but we don't have any dedicated Java programmers in-house. We are PHP / .NET, some of us dabble, but the learning curve could be a little steep.
So, I've been reading. One way we've thought it may be possible is installing a jasper reports server, and creating the templates in Jaspersoft Studio, then using the API to return the PDF files. A colleague stands for this option, because it's mostly done, but 1º is java and 2º I think it's like using a hammer to crack a nut.
Other option we've been toying with is to use C# with iTextSharp to create a server, and create our own API that returns exactly the PDF with the data we need. Doing this we could have some benefits, like using the database connector we have already made and extracting most of the data from the database, instead of having to pass around a big chunk of data, but as it is bare, it doesn't really have a templating system. We'd have create something from with the XMLWorker or with c# classes but it's not really "easy" as drag and drop. For this case I've been reading about XFA too, but documentation on the iText site is misleading and not clear.
I've been also reading about some other alternatives, like PrinceXML, PDFBox, FOP, etc, but the concept will be the same as iText, we'd have to do it ourselves.
My vote, even if it's more work is to go the route of iText and use HTML / CSS for the templates, but my colleagues claim that the templates should be able to be changed every other week (I doubt it), and be easy. HTML / CSS would be too much work.
So the real question is, how do other business approach this? Did I leave anything out on my search? Is there an easier way to achieve this?
PS: I didn't know if SO would be the correct place for this question, but I'm mostly lost and risking a "too broad question" or "off topic" tag doesn't seem that bad.
EDIT:
Input should be sent with the same request. If we decide the C# route, we can get ~70% of the data from the ERP directly, but anyway, it should accept a post request with some data (template, and data needed for that template, like an invoice data, or the invoice ID if we have access to the ERP).
Output should be a PDF (not interested in other formats, just PDF).
Templates will be updated only by IT. (Mostly us, the development team).
Performance wise, I don't know how much muscle we'll need, but right now, without any increase, we are looking at ~500/1000 PDFs daily, mostly printed from 10 to 10.30 and from 12 to 13h. Then maybe 100 more the rest of the day.
TOP performance should not be more than ~10000 daily when the planets align, and is sales season (twice a year). That should be our ceiling for the years to come.
The templates have some requirements:
Have repeating blocks (invoice lines, for example).
Have images as background, as watermark and as blocks.
Have to be multi language (translatable, with the same data).
Have some blocks that are only show on a condition.
Blocks dependent on the page (PDF header / page header / page footer / PDF footer)
Template will maybe have to do calculations over some of the data, I don't think we'll ever need this, but it's something in the future may be asked by the company.
The PDFs don't need to be stored, as we have a document management system, maybe in the future we could link them.
Extra data: Right now we are using "Fast-Reports v2 VCL"

Your question shows you've been considering the problem in detail before asking for help so I'm sure SO will be friendly.
Certainly one thing you haven't detailed much in your description is the broader functional requirements. You mentioned cracking a nut with a hammer, but I think you are focused mostly on the technology/interfacing. If you consider your broader requirements for the documents you need to create, the variables involved, it's might be a bigger nut that you think.
The approach I would suggest is to prototype solutions, assuming you have some room to do so. From your research, pick maybe the best 3 to try which may well include the custom build you have in mind. Put them through some real use-cases end to end - rough as possible but realistic. One or two key documents you need to output should be used across all solutions. Make sure you are covering the most important or most common requirements in terms of:
Input Format(s) - who can/should be updating templates. What is the ideal requirement and what is the minimum requirement?
Output Requirement(s) - who are you delivering to and what formats are essential/desirable
Data Requirement(s) - what are your sources of data and how hard/easy is it to get data from your sources to the reporting system in the format needed?
Template feature(s) - if you are using templates, what features do the templates need? This includes input format(s) but I was mostly thinking of features of the engine like repeating/conditional content, image insertion, table manipulation etc. ie are your invoices, orders and planning documents plain or complex
API requirements - do you have any broader API requirements. You mentioned you use PHP so a PHP library or Web/Web Service is likely to be a good starting point.
Performance - you haven't mentioned any performance characteristics but certainly if you are working at scale (enterprise) it would be worth even rough-measuring the throughput.
iText and Jasper are certainly enterprise grade engines you can rely on. You may wish to look at Docmosis (please note I work for the company) and probably do some searches for PDF libraries that use templates.
A web service interface is possibly a key feature you might want to look at. A REST API is easy to call from PHP and virtually any technology stack. It means you will likely have options about how you can architect a solution, and it's typically easy to prototype against. If you decide to go down the prototyping path and try Docmosis, start with the cloud service since you can prototype/integrate very quickly.
I hope that helps.

From my years of experience in working with PDF I think you should pay attention to the following points:
The performance: You may do the fastest performance with API based pdf files generation in comparision to HTML or XML to PDF generation (because of an additional layer of conversion involved). Considering peaks in the load you may want to calculate the cost of scaling up the generation by adding more servers (and estimate the cost of additional servers or resources required per additional pdf file per day).
Ease of iterations and changes: how often will you need to adjust templates? If you are going to create templates just once (with some iterations) but then no changes required then you should be OK by just coding them using the API. Otherwise you should strongly consider using HTML or XML for templates to simplify changes and to decrease the complexity of making changes in templates;
Search and indexing: If you may need to run search among created documents then you should consider storing indexes of documents generated or maybe store more the source data in XML along with PDF file generated;
Long time preservation: you should better conform to PDF/A sub-format in case you are looking for a long time digital preservation for your documents. See the VeraPDF open source initiative that you may use to validate generated and incoming PDF documents against the conformance to PDF/A requirements;
Preserving source files The PDF format itself was not designed to be edited (though there are some PDF editors already) so you may consider the need of preserving the source data to be able to regenerate PDF documents later and probably introduce additional output formats later.

How to make use of USE SharpNlp in my C# application

I require POS tagging for my files in the corpus.
I have successfully followed the installation instructions of SharpNlp
I am using the binary version
I created a new c# project in: E:\sharp\sharpapp
location of Models Folder is: E:\sharp\sharpapp\bin\Models
location of my SharpNlp Binary is: E:\sharp\SharpNLP-1.0.2529-Bin
I have also followed the instructions to modify both .config files "ParseTree.Exe" and "ToolsExamples.Exe"
Now in my c# project I have a class called tagging.cs where I have to access my corpus text files and do POS tagging for those files. Can anybody help me how can I make use of SharpNlp to do so
Please provide steps to do so.

In a nutshell, SharpNLP is
a port to C# of OpenNLP Tools and OpenNLP MaxEnt
a connector to WordNet
a set of pre-computed models, mostly for the English language
utility modules such as integration with SQLLite
It should be noted that the port of the OpenNLP libraries is relatively informal, with various class and property name changes, possibly loose preservation of features and semantics and no apparent connection with the original Java projects' lifecycle. This situation will likely ensure that in time the OpenNLP portion of SharpNLP will be more akin to distant cousins than twin sisters...
Never the less, it is possible to use examples and documentation from OpenNLP to complement the relatively thin support material available with SharpNLP. Between the source code of SharpNLP and resources like the OpenNLP API reference and the OpenNLP wiki, one can generally map things and adapt accordingly.
A loose conductor could be the study of this particular source file which makes use of OpenNLP in a way that seems close to what you may need. Note the name changes between OpenNLP and SharpNLP, for example POSTTaggerME class becomes MaximumEntropyPosTagger and the Parse() method and its overload turn to TagSentence() and such.
A more general hint is to understand...
...the sequence of steps typically necessary to perform POS Tagging.
This is a very high-level approximate description but, I think, useful.
get the text to be tagged = string(s) of text
Initialize a text parser
parse it = an "array" (or other container) with individual tokens i.e. words and punctuation characters.
initialize the POS Tagger, in particular tell its which model it should use
feed the [ordered] sequence of tokens to the POS Tagger
Ta dah! Use the POS tags for the eventual purpose of your NLP application.
Note how the above sequence assumes that the model is readily available.
The model is a representation of the statistical "profile" of text in general, obtained from training the Tagger with a set of text readily tagged.
SharpNLP comes with a model for generic English language, but in order to tag other languages or if the specific corpora to be tagged belongs to a particular domain (say medical reports or Tweets or...) it may be preferable to re-train the tagger to improve its precision.
Open/SharpNLP as most POS Taggers, whether stand-alone or their API, typically include features to train them (= to produce a model given a sample set of text readily tagged) and also to verify the quality of the model/tagger so produced (= to compare the tags produced on a test set, with the tags expected for this set).

Kindly read through the article that I have written for this. It will give you a detailed step by step method with sample code snippets.
Easy way of Integrating SharpNLP into your project in Visual Studio
I hope this was useful.

What Should I Know and Consider To Create Multi Language Web Site

I'm Creating a Multi Language website with at least 5 language, what should I consider

On a technical front not a lot, you can use a framework like Zend or Kohana or Rails etc. which usually have the ability to replace the content with tags and then fill the tag with the language of choice at run time. The different languages reside in appropriately named directories and can be triggered by the browser language tag or another mechanism. If you are not using a framework with this facility then study one to see how it is done.
After that and in no particular order.
Why multilingual? you really need a compelling reason to do it as the workload you are taking on is large and complex and onerous. I know all the reasons about how and why people like sites in their native language but for a multi lingual site's investment in money you need to be making a proper return on the investment and not just doing it for the sake of it.
Localising, L10N or internationalisation i18n, is not just about language. It is about about cultural differences. Anglo Saxons like cool restrained san-serif type sites. Latin and Latin American cultures like more vibrant colours and cursive type faces. And so on. So you need to have a mechanism that will change the css for each language as well (well to be truly effective you do)
Who is doing the translation? Remember it is an idiomatic translation you need, Google Translate API will not cut it and you need a native speaker to translate from and to the target. So for example if you are using FIGS (French Italian etc) from an English original. You need a translator for English to French, English to German, English to Italian, and English to Spanish (see the costs mounting?). A good bureau will provide all this for you and manage the process though.
Proof reading. Can you speak 5 languages well enough to check that above work is correct?
Maintenance. Again assuming English is the base language and there is a new page or a page rewrite or even a typo you need to go through the above process and update the site so you need a good workflow and process control system to ensure that the changes and updates work effectively. The ongoing maintenance can be crippling in time and cost, work it out before you start.
Beware of advice form people who localise programmes/applications It is not remotely the same thing.
Many solutions actually use separate web sites for each localisation rather than the all in one approach. This can be counter intuitive when we want to put it all into one "technical" package. However you can by separating the sites easily cope with different styles, and character sets and ltr text etc. You can stagger updates and manage the workflow more effectively, adding new site is far far simpler and it allow you to use the different URL's that may be required and allows you to optimize each site for SEO.

Please see: Best Practices for Developing World-Ready Applications
Walkthrough: Using Resources for Localization with ASP.NET

When I developed bi-directional web applications, I found the following practices helped too much:
To make your page as easy to globalize
as possible, follow these best
practices:
❑ Avoid using absolute positioning and
sizes for controls.
❑ Use the entire width and height of
forms.
❑ Size elements relative to the
overall size of the form.
❑ Use a separate table cell for each
control.
❑ Avoid enabling the NoWrap property
in tables.
❑ Avoid specifying the Align property
in tables.
source: MCTS Self-Paced Training kit: Microsoft .NET Framework 2.0 Web-based client development.
As a general tip, I found using tables instead of just DIVs is very helopful.

C# libraries for internationalization?

What libraries are there to write C# internationalized applications?
Typical functionalities that should be contained in the library:
Validation of country specific data (e.g. VAT numbers, phone numbers, addresses,...)
Validation of bank and financial coordinates (e.g. Credit Card numbers, IBAN,...)
Language-specific functionalities (e.g. numbers to words to numbers, summarize,...)
Language specific content filtering (e.g. swearword filtering...)
An example of such libraries in Perl would be the Internationalization/Locale section of CPAN.
What C# solutions are available?
Note: I am not looking for an introduction to the System.Globalization namespace :)
Note 2: Should I desume that there are no options available? Is someone interested in joining forces and create one?
Note 3: Edit to make the question appear on front page in hope of more answers. This isn't such a hard question, how is it possible that Stackers don't ever do i18n?

One project that is working towards a database of globalization, internationalization and localization knowledge is the Unicode Common Locale Data Repository, based on the old ICU project at IBM.
As it is a database of XML data it doesn't contain any .NET-specific code, but as a body of knowledge it is very good.
Only a smallish subset is in the .NET framework. Microsoft hasn't gone near any of the supplemental stuff, like postcode formats, number spelling (for check/cheque amounts), etc. Standard time zone names (from the Olson/tz distribution), etc. are also included, with mappings to the Windows-specific names. Some of the hierarchical locale-specific behaviours also have better support.

I wouldn't say that no one does i18n, but I don't know of any generic tools that can be used for every project. Maintaining a database with all of the information you are looking for would be an epic project. It sounds like what you're looking for isn't a specific C# library, but more a collection of information online that you can draw from. If you were able to find a repository of swear words in various languages (for example), it would be trivial for you to use this in C#. I think that finding a solution that wraps up all of your requirements into an easy-to-use assembly is going to be impossible to find.

Have a look at
http://www.microsoft.com/globaldev/getwr/dotneti18n.mspx
and
http://www.dotneti18n.com/

String to number and vice versa can be dones as following:
culture = new CultureInfo(locale);
int number = Convert.ToInt32(myString, culture.NumberFormat);
string str= Convert.ToString(myNumber, culture.NumberFormat);
As to checking VATS and adresses, I'm interested in that too, haven't found anything useful so far.

Not exactly a "library", per se, but I've actually ran into a great service (for pay), by a company called E4X (former client of mine).
What they provide is complete localization of your ecommerce site, including language translations, currency exchanges, local billing and handling of financial transactions including region-specific taxes etc, and more. They even deal with logisitics of physical shipping...
Worth looking into, for an ecommerce business. Let 'em know I sent you... ;-)

That's a huge endeavor. Let's start with one simple problem: phone numbers. Libphonenumber Google library at http://code.google.com/p/libphonenumber/ has a C# port at https://bitbucket.org/pmezard/libphonenumber-csharp with notes at http://blog.thekieners.com/2011/06/06/using-googles-libphonenumber-in-microsoft-net-with-c/. Appears to be a good library for handling both US and int'l numbers.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.