Question:
I want to render MediaWiki syntax (and I mean MediaWiki syntax as used by WikiPedia, not some other wiki format from some other engine such as WikiPlex), and that in C#.
Input: MediaWiki Markup string
Output: HTML string
There are some alternative mediawiki parsers, but nothing in C#, and additionally pinvoking C/C++ looks bleak, because of the structure of those libaries.
As syntax guidance, I use
http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet
My first goal is to render that page's markup correctly.
Markup can be seen here:
http://en.wikipedia.org/w/index.php?title=Wikipedia:Cheatsheet&action=edit
Now, if I use Regex, it's not of much use, because one can't exactly say which tag ends which starting ones, especially when some elements, such as italic, become an attribute of the parent element.
On the other hand, parsing character by character is not a good approach either, because
for example ''' means bold, '' means italic, and ''''' means bold and italic...
I looked into porting some of the other parsers' code, but the java implementations are obscure, and the Python implementations have have a very different regex syntax.
The best approach I see so far would be to port mwlib to IronPython
http://www.mediawiki.org/wiki/Alternative_parsers
But frankly, I'm not looking forward to having the IronPython runtime added as a dependency to my application, and even if I would want to, the documentation is bad at best.
Update per 2017:
You can use ParseoidSharp to get a fully compatible MediaWiki-renderer.
It uses the official Wikipedia Parsoid library via NodeServices.
(NetStandard 2.0)
Since Parsoid is GPL 2.0, and and the GPL-code is invoked in nodejs in a separate process via network, you can even use any license you like ;)
Pre-2017
Problem solved.
As originally assumed, the solution lies in using one of the existing alternative parsers in C#.
WikiModel (Java) works well for that purpose.
First attempt was pinvoke kiwi.
It worked, but but failed because:
kiwi uses char* (fails on anything non-English/ASCII)
not thread safe.
bad because of the need have a native dll in the code for every architecture
(did add x86 and amd64, then it went kaboom on my ARM processor)
Second attempt was mwlib.
That failed because somehow IronPython doesn't work as it should.
Third attempt was Swebele, which essentially turned out to be academic vapoware.
The fourth attempt was using the original mediawiki renderer, using Phalanger. That failed because the MediaWiki renderer is not really modular.
The fifth attempt was using Wiky.php via Phalanger, which worked, but was slow and Wiky.php doesn't very completely implement MediaWiki.
The sixth attempt was using bliki via ikvmc, which failed because of the excessive use of 3rd party libraries ==> it compiles, but yields null-reference exceptions only
The seventh attempt was using JavaScript in C#, which worked but was very slow, plus the MediaWiki functionality implemented was very incomplete.
The 8th attempt was writing an own "parser" via Regex.
But the time required to make it work is just excessive, so I stopped.
The 9th attempt was successful.
Using ikvmc on WikiModel yields a useful dll.
The problem there was the example-code was hoplessly out of date.
But using google and the WikiModel sourcecode, I was able to piece it together.
The end-result can be found here:
https://github.com/ststeiger/MultiWikiParser
Why shouldn't this be possible with regular expressions?
inputString = Regex.Replace(inputString, #"(?:'''''')(.*?)(?:'''''')", #"<strong><em>$1</em></strong>");
inputString = Regex.Replace(inputString, #"(?:''')(.*?)(?:''')", #"<strong>$1</strong>");
inputString = Regex.Replace(inputString, #"(?:'')(.*?)(?:'')", #"<em>$1</em>");
This will, as far as I can see, render all 'Bold and italic', 'Bold' and 'Italic' text.
Here is how I once implemented a solution:
define your regular expressions for Markup->HTML conversion
regular expressions must be non greedy
collect the regular expressions in a Dictionary<char, List<RegEx>>
The char is the first (Markup) character in each RegEx, and RegEx's must be sorted by Markup keyword length desc, e.g. === before ==.
Iterate through the characters of the input string, and check if Dictionary.ContainsKey(char). If it does, search the List for matching RegEx. First matching RegEx wins.
As MediaWiki allows recursive markup (except for <pre> and others), the string inside the markup must also be processed in this fashion recursively.
If there is a match, skip ahead the number of characters matching the RegEx in input string. Otherwise proceed to next character.
Kiwi (https://github.com/aboutus/kiwi, mentioned on http://mediawiki.org/wiki/Alternative_parsers) may be a solution. Since it is C based, and I/O is simply done by stdin/stdout, it should not be too hard to create a "PInvoke"-able DLL from it.
As with the accepted solution I found parsoid is the best way forward as it's the official library - and has the greatest support for the wikimedia markup; that said I found ParseoidSharp to be using obsolete methods such as Microsoft.AspNetCore.NodeServices and really it's just a wrapper for a fairly old version of pasoid's npm package.
Since there is a fairly current version of parsoid in node.js you can use Jering.Javascript.NodeJS to do the same thing as ParseoidSharp, the steps are fairly similar also.
Install nodeJS (
download parsoid https://www.npmjs.com/package/parsoid place the required files in your project.
in powershell cd to your project
npm install
Then it's as simple as
output = StaticNodeJSService.InvokeFromFileAsync(Of String)(HttpContext.Current.Request.PhysicalApplicationPath & "./NodeScripts/parsee.js", args:=New Object() {Markup})
Bonus it's now much easier than ParseoidSharp's method to add the options required, e.g. you'll probably want to set the domain to your own domain.
Related
I'm looking for a command-line tool where I can specify regex patterns (or similar) for certain file extensions (e.g. cs files, js files, xaml files) that can provide errors/warnings when run, like during a build. These would scan plain-text source code of all types.
I know there are tools for specific languages... I plan on using those too. This tool is for quick patterns we want to flag where we don't want to invest in writing a Rosyln rule, for example. I'd like to flag certain patterns or API usages in an easy way where anyone can add a new rule without thinking too hard. Often times we don't add rules because it is hard.
Features like source tokenization is bonus. Open-source / free is mega bonus.
Is there such a tool?
If you want to go old-skool, you can dust-off Awk for this one.
It scans file line by line (for some configurable definition of line, with a sane default) cuts them in pieces (on whitespace IMMSMR) and applies a set of regexes and fires the code behind the matching regex. There are some conditions to match the beginning and end of a file to print headers/footers.
It seems to be what you want, but IMHO, a perl or ruby script is easier and has replaced AWK for me a long time ago. But it IS simple and straightforward for your use-case AFAICT.
Our Source Code Search Engine (SCSE) can do this.
SCSE lexes (using language-accurate tokenization including skipping language-specific whitespace but retaining comments) a set of source files, and then builds a token index for each token type. One can provide SCSE with token-based search queries such as:
'if' '(' I '='
to search for patterns in the source code; this example "lints" C-like code for the common mistake of assigning a variable (I for "identifier") in an IF statement caused by accidental use of '=' instead of the intended '=='.
The search is accomplished using the token indexes to speed up the search. Typically SCSE can search millions of lines of codes in a few seconds, far faster than grep or other scheme that insists on reading the file content for each query. It also produces fewer false positives because the token checks are accurate, and the queries are much easier to write because one does not have to worry about white space/line breaks/comments.
A list of hits on the pattern can be logged or merely counted.
Normally SCSE is used interactively; queries produce a list of hits, and clicking on a hit produces a view of a page of the source text with the hit superimposed. However, one can also script calls on the SCSE.
SCSE can be obtained with langauge-accurate lexers for some 40 languages.
Working in C# I have an array of strings. Some of these strings are real words, others are complete nonsense. My goal is to come up with a way of deciding which of these words are real and which are false.
I had planned to find some kind of word list online that I could bring into my project, turn into a list, and compare against, but of course typing in "C# dictionary" comes up with an unrelated topic! I don't need a 100% accuracy rate.
To formalize the question:
In C#, what is the recommended way to establish whether or not a string is a real word?
Advice and guidance is very much appreciated!
Solution
Thanks for the great answers, they were all very useful. As it happens the thing to do was ask the same question in a different wording. Searching for C# spellcheck brought up some great links and I ended up using Nhunspell which you can get through NuGet, and is very easy to use.
The problem is that "Dictionary" is a type within the framework. So, searching with that word will end up with all sorts of results. What you are basically wanting to do is Spell Check. This will determine if a word is valid or not.
Searching for C# spell check yielded some promising results. Searching for open source spell check also has some.
I have previously implemented one of the open source ones within a VB6 project. I think it was ASpell. I haven't had to use spell check library within C#, but I'm sure there is one, or at least one with a .NET wrapper to make implementation easier.
If you have special case words that do not exist in the dictionary/word file for a spell check solution, you can add them.
To do this I would use a freely available dictionary for linux (googling "linux dictionaries" should get you on the right track), read and parse the file, and store it in a C# System.Collections.Generic.HashSet collection. I would probably store everything as .ToUpper() or as .ToLower() but this depends on your requirements.
You can then check if any arbitrary string is in the HashSet efficiently.
I don't know of any word list file included by default on Windows, but most Unix-like operating systems include a words file for this purpose. Someone has also posted a words file on github suggested for use in Windows projects. These files are simple lists of words, one per line.
I'm trying to write a C# library to manipulate my C/C++ header files.. I want to be able to read and parse the headers file and manipulate function prototypes and data structures in C#. I'm trying to avoid writing a C Parser, due to all code brances caused by #ifdefs and stuff like that.
I've tryed playing around with EnvDTE, but couldn't find any decent documentation.
Any ideas how can I do it?
Edit -
Thank you for the answers... Here are some more details about my project: I'm writing a ptrace-like tool for windows using the debugging API's, which enable me to trace my already compiled binaries and see which windows API's are being called. I also want to see which parameter is given in each call and what return values are given, so I need to know the definition of the API's. I also want to know the defition for my own libraries (hence, the header parsing approach). I thought of 3 solutions:
* Parsing the header files
* Parsing the PDB files (I wrote a prototype using DIA SDK, but unfortionatly, the symbols PDB contained only general info about the API's and not the real prototypes with the parameters and return values)
* Crawling over the MSDN online library (automaticly or manualy)
Is there any better way for getting the names and types for windows API's and my libraries in runtime in c#?
Parsing C (even "just" headers) is hard; the language is more complex than people remember,
and then there's the preprocessor, and finally the problem of doing something with the parse. C++ includes essentially all of C, and with C++11 here the problem is even worse.
People can often hack a 98% solution for a limited set of inputs, often with regexes in Perl or some other string hackery. If that works for you, then fine. Usually what happens is that 2% causes the hacked parser to choke or to produce the wrong answer, and then you get to debug the result and hand hack the 98% solution output.
Hacked solutions tend to fail pretty badly on real header files, which seem to concentrate weirdness in macros and conditionals (sometimes even to the point of mixing different dialects of C and C++ in the conditional arms). See a typical Microsoft .h file as an example. This appears to be what OP wants to process. Preprocessing gets rid of part of the problem, and now you get to encounter the real complexity of C and/or C++. You won't get a 98% solution for real header files even with preprocessing; you need typedefs and therefore name and type resolution, too. You might "parse" FOO X; that tells you that X is of type FOO... oops, what's that? Only a symbol table knows for sure.
GCCXML does all this preprocessing, parsing, and symbol table construction ... for the GCC dialect of C. Microsoft's dialect is different, and I don't think GCCXML can handle it.
A more general tool is our DMS Software Reengineering Toolkit, with its C front end; there's also a C++ front end (yes, they're different; C and C++ aren't the same language by a long shot). These process a wide variety of C dialects (both MS and GCC when configured properly), does macro/conditional expansion, builds an AST and a symbol table (does that name and type resolution stuff correctly).
You can add customization to extract the information you want, by crawling over the symbol table structures produced. You'll have to export what you want to C# (e.g. generate your C# classes), since DMS isn't implemented in a .net language.
In the most general case, header files are only usable, not convertable.
This due the possibility of preprocessor (#define) use of macros, fragments of structures constants etc which only get meaning when used in context.
Examples
anything with ## in macros
or
//header
#define mystructconstant "bla","bla"
// in using .c
char test[10][2] ={mystructconstant};
but you can't simply discard all macros, since then you won't process the very common calling convention macros
etc etc.
So header parsing and conversion is mostly only possible for semi automated use (manually run cleaned up headers through it) or for reasonably clean and consistent headers (like e.g. the older MS SDK headers)
Since the general case is so hard, there isn't much readily available. Everybody crafts something quick and dirty for its own headers.
The only more general tool that I know is SWIG.
Is there a way to get MarkdownSharp (I'm using the NuGet package) to handle 'GitHub flavored Markdown (GFM)' and especially syntax highlighting of c# code, which (in GFM) is written like this:
```c#
//my code.....
```
So, if I pass Markdown formatted content to MarkDownSharp, containg a C# code block (as above) I want it to generate syntax highlighted html for that c# code. Any ideas? I know I can use the supported 4 spaces to indicate a code block, but again, I'm seeking a solution for getting it to support GitHub flavored Markdown.
I have made some light modifications to MarkdownSharp that will transform github flavored fenced code blocks
https://github.com/KyleGobel/MarkdownSharp-GithubCodeBlocks
```cs
Console.WriteLine("Fenced code blocks ftw!");
```
Would become
<pre><code class='language-cs'>
Console.WriteLine("Fenced code blocks ftw!");
</code></pre>
It handles the cases I needed to use, there are probably lots of edge cases though, feel free to fork/change/modify/pull request. Markdown sharp has plenty of comments and is all only one file, so it's not too bad to modify.
Here's the result: https://github.com/danielwertheim/kiwi/wiki/Use-with-Asp.Net-MVC
//D
As one can read in this post, GitHub relies on RedCarpet to render Markdown syntax.
However, Vicent Marti (Sundown (ex-Upskirt) and RedCarpet maintainer) states that the syntax highlighting is specifically handled by Pygments, a python library.
Back to your concern, I can think of several options to benefit from syntax highlighting from C#:
Try and build a compiled managed version of Pygments source code thanks to IronPython ("IronPython’s Hosting APIs can be used to compile Python scripts into DLLs, console executables, or Windows executables.")
Port Pygment to C#
Use a different syntax highlighting product (for instance, ColorCode which is used by Codeplex...)
Then either:
Fork MarkDownSharp to make it accept plug-ins
Similarly to what GitHub does, use the managed syntax highlighting product and post process the Html generated by MarkDownSharp
By the way, as a MarkDown alternative, you might want to consider Moonshine, a managed wrapper on top of Sundown which is said to be "at least 20x faster than MarkdownSharp when run against MarkdownSharp's own benchmark app."
I need to parse a simple statement (essentially a chain of function calls on some object) represented as a string variable into a CodeDom object (probably a subclass of CodeStatement). I would also like to provide some default imports of namespaces to be able to use less verbose statements.
I have looked around SO and the Internet to find some suggestions but I'm quite confused about what is and isn't possible and what is the simplest way to do it. For example this question seems to be almost what I want, unfortunately I can't use the solution as the CodeSnippetStatement seems not to be supported by the execution engine that I use (the WF rules engine).
Any suggestions that could help me / point me into the right direction ?
There is no library or function to parse C# code into CodeDOM objects as part of the standard .NET libraries. The CodeDOM libraries have some methods that seem to be designed for this, but none of them are actually implemented. As far as I know, there is some implementation available in Visual Studio (used e.g. by designers), but that is only internal.
CodeSnippetStatement is a CodeDOM node that allows you to place any string into the generated code. If you want to create CodeDOM tree just to generate C# source code, than this is usually fine (the source code generator just prints the string to the output). If the WF engine needs to understand the code in your string (and not just generate source code and compile it), than CodeSnippetStatement won't work.
However, there are 3rd party tools that can be used for parsing C# source code. In one project I worked on, we used NRefactory library (which is used in SharpDevelop) and it worked quite well. It gives you some tree (AST) representing the parsed code and I'm afraid you'll need to convert this to the corresponding CodeDOM tree yourself.
I have found a library implementation here that seems to cover pretty much everything I need for my purposes. I don't know if it's robust enough to be used in business scenarios, but for my unit tests it's pretty much all I can ask for.