I have a a review data set of about 250000 reviews of hotels, I'm planing to extract aspects from it using crfsharp dll, however the data that I have is in normal text paragraph form and I need to convert it into the format of crfsharp so I can train and test data to extract aspects. Well can someone tell me what will be the best way to do that, I was thinking of writing a small program for data format conversion.
Another thing I was wondering whether can CRF sharp do aspect extraction using crf models it has? I'm using c#.
What's features and tags will you use in your task ?
There is a simplest example. For a sentence "! Tokyo and New York are major financial centers." If you want to extract location name from it and your only feature is token string, you can generate training corpus as belows:
! NOR
Tokyo LOCATION
and NOR
New LOCATION
York LOCATION
are NOR
major NOR
financial NOR
centers NOR
. NOR
The first column is the term of the sentence, the second column is the corresponding tags. NOR means normal term, LOCATION means location name. You can generate training corpus as above format and use CRFSharp to train a model.
For more complex example, such as more features, template, adding word position in tags, you can refer another example in CRFSharp home page(http://crfsharp.codeplex.com).
Related
I'm working on an integration process that requires the currency conversion between a list of values in specific currency to a concrete given currency.
For this process will exists 2 files, one containing the exchange rates and other containing the prices with the origin currency.
The exchange rates files looks like this:
Text:USDtoEUR;Origin:USD;Destination:EUR;Value:0.7
Text:EURtoCAD;Origin:EUR;Destination:CAD;Value:0.5
The file containing the prices with the origin currency (and also the target currency) looks like this:
Index:0;TargetCurrency:CAD
Index:1;Description:Product1;Value:150;Currency:EUR
Index:2;Description:Product2;Value:3;Currency:USD
For this specific case there is no direct way to convert from USD to CAD, so I need to first convert it to another currency present in the file that has CAD exchange rate (EUR) and then convert it to CAD.
This is a very basic scenario, but I'm guessing those files can contain more complex ones, where maybe it's required to convert 2 or 3 times before reaching the target currency.
What I'm planning to do is to insert the content of the exchange rates file into a SQL Server table and then start a very manual process of looking records containing the target currency... but I've never faced this scenario and don't know if this could be an acceptable approach in terms of speed/performance, that's why I'm wondering if there is a standard algorithm or data structure best suited for this process.
I will appreciate your help
If you need to take the currency conversion rate into consideration to find an optimal conversion path, you would use Bellman-Ford Algorithm .
This link may help you.
But, if the performance of the conversion is matters, you need to use an algorithm to find the shortest path between two nodes (visiting fewer nodes, regardless of the conversion cost) like BFS or DFS
(means traversing the tree to find the shortest path between two nodes(two currencies).
Working with iText7 library version 7.0.2.2 in a c# web application. A PDF document is produced with n-number of dynamically created pages based on the amount of data.
Is there a way to set a field with a calculated formula at run time? So for example, something along the lines of having a subtotal field calculation like
the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice.
Itext is merely creating the view, the data model feeding it is provided by you.
Furthermore, itext only draws text strings, not numerical types, so it would have to parse those strings back to numbers which can be difficult considering all the ways numbers can be formatted with commas, periods, plus and minus signs, brackets, units,...
And the text pieces you draw with itext are not named.
And itext flushes contents to the output as soon as possible to save memory.
...
So no, itext does not provide support for "the product of Page1.Lineitem1.qty and Page1.LineItem1.unitprice" or similar expressions.
I am a novice at programming, working on a C# solution for a geomorphology project. I need to extract coordinates from a variable number of Google Earth KML ground overlay files, converted to one long text string, and enter them into an array that can be accessed by other methods.
The KML tags and data of interest look like this:
<LatLonBox>
<north>37.91904192681665</north>
<south>37.46543388598137</south>
<east>15.35832653742206</east>
<west>14.60128369746704</west>
<rotation>-0.1556640799496235</rotation>
</LatLonBox>
The text files I will be processing with the program could have between 1 and a 100 or more of these data groups, each embedded within the standard KML file headers/footers and other tags extraneous for my work. I have already developed the method for extracting the coordinate values as strings and have tested it for one KML file.
At this point it seems that the most efficient approach would be to construct some kind of looping method to search through the string for a coordinate data group, extract the data to a row in the array, then continue to the next group. The method might also go through the string and extract all the "north" data to a column in the array first, then loop back for all the "south" data, etc. I am open to any suggestions.
Due to my limited programming background, straight-forward solutions would be preferred over elegant or advanced solutions, but give it your best shot.
Thanks for your help.
I need to convert a user's UTM input (WGS 1984) into Decimal Degrees, preferably using ESRI's ArcGis. I've already got the code to retrieve the zone (formatted like 14N, 22S, etc.) and the easting and northing factors. What do I do from here?
Edit: we expect the input as a string like: 14N 423113mE 4192417mN. I can easily extract the numbers (and a character) 14, N, 423113, and 4192417 from the string above. I just need to somehow translate that to Decimal Degrees.
There is no specific information about input data.
Here is some general info to start from:
The easiest way is to use Geoprocessing engine to reproject the whole feature class. Use C# class for Project tool from Data Management toolbox.
Another way is to use Project method of IGeometry is you want project only several features.
EDIT: for your input data use solution 2.
One more easier way is to use .NET port of open-source library Proj.4 - Proj4Net. For such simple task it is much more easier to use than ArcObjects classes.
I have a list of segments (15000+ segments), I want to find out the occurence of segments in a given string. The segment can be single word or multiword, I can not assume space as a delimeter in string.
e.g.
String "How can I download codec from internet for facebook, Professional programmer support"
[the string above may not make any sense but I am using it for illustration purpose]
segment list
Microsoft word
Microsoft excel
Professional Programmer.
Google
Facebook
Download codec from internet.
Ouptut :
Download codec from internet
facebook
Professional programmer
Bascially i am trying to do a query reduction.
I want to achieve it less than O(list length + string length) time.
As my list is more than 15000 segments, it will be time consuming to search entire list in string.
The segments are prepared manully and placed in a txt file.
Regards
~Paul
You basically want a string search algorithm like Aho-Corasik string matching. It constructs a state machine for processing bodies of text to detect matches, effectively making it so that it searches for all patterns at the same time. It's runtime is on the order of the length of the text and the total length of the patterns.
In order to do efficient searches, you will need an auxiliary data structure in the form of some sort of index. Here, a great place to start would be to look at a KWIC index:
http://en.wikipedia.org/wiki/Key_Word_in_Context
http://www.cs.duke.edu/~ola/ipc/kwic.html
What your basically asking how to do is write a custom lexer/parser.
Some good background on the subject would be the Dragon Book or something on lex and yacc (flex and bison).
Take a look at this question:
Poor man's lexer for C#
Now of course, alot of people are going to say "just use regular expressions". Perhaps. The deal with using regex in this situation is that your execution time will grow linearly as a function of the number of tokens you are matching against. So, if you end up needing to "segment" more phrases, your execution time will get longer and longer.
What you need to do is have a single pass, popping words on to a stack and checking if they are valid tokens after adding each one. If they aren't, then you need to continue (disregard the token like a compiler disregards comments).
Hope this helps.