How should I start writing a parser for BibTex files. As the initial design I see following steps.
List down grammar
Build a tokenizer
Do parsing of token stream against grammar
We also need some error mechanism, so the users uploading bibtex files can know line numbers where is the error in their BibTex files. I am looking for community opinion to target this problem.
(please point if there are any existing open source C# or VB.NET BibTex parsers.)
There are many tools available to assist you with this, such as ANTLR or the GOLD Parsing System. I usually use the latter to create my parser grammars.
I've published an open source library for BibTex format (load/save/export to Excel), allowing both non-typed (Key/Value dictionary) and strong typed access to the BibTex entries.
It might not fit well your purpose, as it is weak on validation (has none of it :) ), but might help anyway:
Nuget Package
GitHub repository
About the package on my web site
Related
I am looking for an LDIF parser for C#. I am trying to parse an LDIF file so that I can check objects don't exist before adding them. Adding them when the already exist using ntdsSchemaAdd) causes an entry in the error logs.
A quick websearch revealed: http://wiki.github.com/skradel/Zetetic.Ldap/. They have provided a .net API.
From the page:
Zetetic.Ldap is a .NET library for
.NET 2 and above, which makes it
easier to work with directory servers
(like Active Directory, ADAM, Red Hat
Directory Server, and others). Some of
the key features of Zetetic.Ldap are:
1.LDIF file parsing and generation – Read and write the file format used
for moving data around between
directory systems
2.LDAP Entry-oriented API with change tracking – Create and modify directory
objects in a more natural way
3.LDAP Schema interrogation – Quick programmatic access to the kinds of
objects and fields your directory
server understands. Learn if an
attribute is a string, a number, a
date, etc., without lots of manual
research and re-parsing
4.LDIF Pivoter – Turn an LDIF file into a (comma or tab-delimited) flat
file for analysis or loading into
systems that don’t speak LDIF We built
the Zetetic.Ldap library to make
directory projects and programming
faster and easier, and release it here
in the hopes that others will find it
useful too. As far as we know, this is
the only .NET library that really
understands the LDIF specification.
Download link: http://github.com/downloads/skradel/Zetetic.Ldap/Zetetic.Ldap_20090831.zip
I would parse it myself.
If you look at the LDIF RFC for the EBNF, you'll see that it's not a very complex grammar.
I've parsed a large amount of LDIF before using Regexes reliably. Though your mileage may vary.
Having a server that other devs use, I currently log the version of the dll they use. I do that by having the client that use Reflection to retrieve its version:
Assembly.GetEntryAssembly().GetName().Version.ToString();
It's nice, but since it come from dev that uses TFS and do themself the build, I can not see if they have the latest version of the sources. Is there a trick, like a compilation tag, that would easily allow a hash of the generating source code?
Note: I have try to send the MD5 of the dll (using assembly.Location), but it is useless since the hash value changes between 2 compilations (I suppose there is some compilation timestamp inside the generated dll).
This is most collaboraton issue then a coding.
In the moment that you find out that the version is old one.notify them about it.
If the real version is not old one, that means that developers before making buold did not increment the version ID, which is mistake.
In other words, ordanize it among people, and not relly on these kind of tools (if there is any). You trying to create a complicated tool, that will help you avoid mistakes, but humans will find a way to make them again.
So it's better to create solid relation structure among you, imo.
Create a tool on pre build event to hash/last-write-time your code files.
Write the result to a cs file or a embedded resource file.
The result file must exclude in above action.
For prevent skip build (up-to-date) feature not work,Compare the file before write.
And if youre opening the file in IDE will get a prompt `changed from out side' when build.
Seem there is no easy way to do it.
I require POS tagging for my files in the corpus.
I have successfully followed the installation instructions of SharpNlp
I am using the binary version
I created a new c# project in: E:\sharp\sharpapp
location of Models Folder is: E:\sharp\sharpapp\bin\Models
location of my SharpNlp Binary is: E:\sharp\SharpNLP-1.0.2529-Bin
I have also followed the instructions to modify both .config files "ParseTree.Exe" and "ToolsExamples.Exe"
Now in my c# project I have a class called tagging.cs where I have to access my corpus text files and do POS tagging for those files. Can anybody help me how can I make use of SharpNlp to do so
Please provide steps to do so.
In a nutshell, SharpNLP is
a port to C# of OpenNLP Tools and OpenNLP MaxEnt
a connector to WordNet
a set of pre-computed models, mostly for the English language
utility modules such as integration with SQLLite
It should be noted that the port of the OpenNLP libraries is relatively informal, with various class and property name changes, possibly loose preservation of features and semantics and no apparent connection with the original Java projects' lifecycle. This situation will likely ensure that in time the OpenNLP portion of SharpNLP will be more akin to distant cousins than twin sisters...
Never the less, it is possible to use examples and documentation from OpenNLP to complement the relatively thin support material available with SharpNLP. Between the source code of SharpNLP and resources like the OpenNLP API reference and the OpenNLP wiki, one can generally map things and adapt accordingly.
A loose conductor could be the study of this particular source file which makes use of OpenNLP in a way that seems close to what you may need. Note the name changes between OpenNLP and SharpNLP, for example POSTTaggerME class becomes MaximumEntropyPosTagger and the Parse() method and its overload turn to TagSentence() and such.
A more general hint is to understand...
...the sequence of steps typically necessary to perform POS Tagging.
This is a very high-level approximate description but, I think, useful.
get the text to be tagged = string(s) of text
Initialize a text parser
parse it = an "array" (or other container) with individual tokens i.e. words and punctuation characters.
initialize the POS Tagger, in particular tell its which model it should use
feed the [ordered] sequence of tokens to the POS Tagger
Ta dah! Use the POS tags for the eventual purpose of your NLP application.
Note how the above sequence assumes that the model is readily available.
The model is a representation of the statistical "profile" of text in general, obtained from training the Tagger with a set of text readily tagged.
SharpNLP comes with a model for generic English language, but in order to tag other languages or if the specific corpora to be tagged belongs to a particular domain (say medical reports or Tweets or...) it may be preferable to re-train the tagger to improve its precision.
Open/SharpNLP as most POS Taggers, whether stand-alone or their API, typically include features to train them (= to produce a model given a sample set of text readily tagged) and also to verify the quality of the model/tagger so produced (= to compare the tags produced on a test set, with the tags expected for this set).
Kindly read through the article that I have written for this. It will give you a detailed step by step method with sample code snippets.
Easy way of Integrating SharpNLP into your project in Visual Studio
I hope this was useful.
I am looking for the best way to implement a winform with different languages, but i don't want to use the resources files of Visual Studio because you always have to recompile.
I have found the following solution to use XML files without compilation:
http://www.codeproject.com/KB/miscctrl/xml_localization.aspx
I find it is OK, users can edit the xml and in the future they could provide my application with translations.
Do you know a better way for this?
Instead of defining your own XML-based internalization system you could go the standard way by using one such as Translation Memory eXchange format.
I would like to write all meta data (including advanced summary properties) for my files in a windows folder to a csv file. Is there a way to collect all the attributes? I see mp3 files have a different set of attributes compared to jpg files. (c#)
This can also be a script (vb, perl)
Update: by looking at libextractor (thank you) I can see this can be achieved by writing different plugins for different type of files. I gather this meta data is not a simple collection...
In Perl, you can use MP3::Tag or MP3::Info
If you can cope w/ VB.Net: http://www.codeproject.com/KB/vb/mp3id3v1.aspx
If you can cope w/ C++/.Net: http://www.codeproject.com/KB/audio-video/mp3fileinfo.aspx
For either (assuming the C++) is compiled to .Net, you can use Reflector to disassemble the binary and convert it to C#. Check w/ the respective authors about their licenses first (usually Code Project articles are under an open license like CPOL).
In a library? Try libextractor if your software is GPL.
Ok, after the clarification edits, I would suggest looking at the introspection available in .Net. I will warn you however that I think you will get more satisfying results if you forgo introspection and define the specific properties that you want for the file types that you expect to see.
Since scripting is valid, then if this were my problem to solve I would use Powershell since the .net introspection is baked in.
It may not be worth it to add all of the data from a jpeg file (exif data). I would hand pick what attributes I wanted from those files.