How does wikimedia transform its model syntax?

How does wikimedia transform its model syntax? - c#

I would like to know how does Wikimedia transform its model syntax ({{model|options}}) into html code.
I have a regex for a simple model ({{.*?}}) but it fails for a nested model (ex: {{model|options containing a {{submodel|options}}...}})

Remember,
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
That said, you can read: Forum tags. What is the best way to implement them? I made an example of nested tags, both with "pure" Regex and with a "more stable" C# parser that uses a little of Regexes but keeps the stack out of the Regex hands.
You can do it with balancing groups. They aren't part of "base" Regex (and some persons don't consider them to be true regexes),
But I wouldn't program something as big as Wiki with something like a regex. The problem of regexes is that it's quite difficult to program them so that they don't backtrack (there is an option to do it, but it's difficult to build a regex that doesn't need backtracking or that need only limited amout of backtracking), and when they begin to backtrack it's the end: they could stall for minutes searching for the right combination of captures.

Related

Fastest way of removing unicode codes from a string

Hi I'm trying to figure out a way to remove the tags from the results returned from the Google Feed API. Specifically they are placing bold tags on titles and inside the description.
The codes that are being inserted are as follows:
\u003cb
\u003e
\u003c/b\u003e
Since its a fixed amount I did try doing a String.Replace() for each of these codes per string but it resulted in bad performance not surprisingly. I'm not sure if RegEx would be better (or worse). Does anyone have an idea on how to remove these? Google does not supply an option to remove tags from the results.

You could remove the unicode codes using a regex like this one:
\\u[\d\w]{4}
var subject = #"\u003cb\u003e\u003c/b\u003e";
var result = Regex.Replace(subject, #"\\u[\d\w]{4}", String.Empty);
As for performance, this article seems to suggest that regex is much slower, but i would run your own tests with your own data as it might be wildly different. The regular expression itself will play a big part in performance and I don't think that article states what the regex is being used so its impossible to compare. The size and type of your data will also play a big part, so it's difficult to say which is better without understanding your data.
Also, you should try compiling the regex with the RegexOptions.Compiled flag to see if that boosts performance.

Regular expression in C# , is this possible?

I never use regular expression before and plan to use it to solve my problem but not quite sure whether it can help me.
I have a situation where I need store a rule or formula to build string values like following examples in a database field then retrieve this rule and build the string value.
FacilityCode + Left(ModelNO,2)
Right(PO,3) + Left(Serial,2)
Is this achievable using .net regular expression? Any good tutorial or simple examples of this problem.

Regexp : http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.aspx
But it doesn't seems fitting :)

It might be better to code some random string generator. Regex is for searching data not creating data.
The thing to remember about regex is that it is like an aircraft carrier; it does one thing very very well, it does not do other jobs very well at all.
An aircraft carrier moves planes very well on the ocean; it does not make a cheese sandwich well AT ALL!!
That is to say, if you use regex when you shouldn't you will almost certainly use far more processing power than if you used another tool for that job. Html parsing comes to mind.

Regex is provided as part of System.Text.RegularExpressions, but you can't rely exclusively on it. It'll let you search existing strings, but you'll need to implement your own logic for building new strings based on what you find in the existing data.
Also, keep in mind that System.Text.RegularExpressions works differently from regexp in Perl and other implementations. For example, it doesn't recognize POSIX character class definitions.
Since you're new to regex, you might want to check out the "Regular Expressions User Guide" on zytrax.com. It's not as comprehensive as an O'Reilly manual, but it'll do as a start.

c# parse a string that contains conditions , key=value

I m giving a string that contains several different combination of data.
For example :
string data = "(age=20&gender=male) or (city=newyork)"
string data1 = "(job=engineer&gender=female)"
string data2 = "(foo =1 or foo = 2) & (bar =1)"
I need to parse this string and create structure out of it and i have to evaluate this to a condition of another object. eg: if the object has these properties, then do something , else skip etc.
What are the best practices to do this?
Should i use a parser such as antlr and generate tokens out of the string. etc.?
reminder : there are several combinations of how this string is created. but it s all and/or.

Something like ANTLR is probably overkill for this.
A simple implementation of the shunting-yard algorithm would probably do the trick quite nicely.

Using regular expressions may work if the example is very simple, but it will more likely lead to a code that is impossible to maintain. Using some other approach to parsing seems like a good idea.
I would take a look at NCalc - it is mainly focused on parsing mathematical expressions, but it seems to be quite customizable (you can specify your functions and constants), so it may work in your scenario as well.
If this is too complex for your purpose, you can use any "parser generator" for C#. Using ANTLR is one great option - here is an example that shows how to start writing something like your example Five minute introduction to ANTLR
You could also try using F#, which is a great language for this kind of problem. See for example FsLex Sample by Chris Smith, which shows a simple mathematical evaluator - processing the parsed expression in F# would be a lot easier than in C#. In F#, you could also use FParsec, which is very lightweight, but may be a bit difficult to follow if you're not used to F#.

I suggest you to have a look at regular expressions: http://www.codeproject.com/KB/dotnet/regextutorial.aspx

Antlr is a great tool, but you can probably do this with regular expressions. One of the nice things about the .NET regex engine is support for nested constructs. See
http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/
and this SO post.

Seems like you might want to use Regular Expressions to do this.
Read up a little bit on Regular Expressions in .NET. Here are some good articles:
http://msdn.microsoft.com/en-us/library/hs600312.aspx
http://www.regular-expressions.info/dotnet.html
When it comes time to write/test your Regular expression i would highly recommend using RegExLib.com's regex tester.

ANTLR or Regex?

I'm writing a CMS in ASP.NET/C#, and I need to process things like that, every page request:
<html>
<head>
<title>[Title]</title>
</head>
<body>
<form action="[Action]" method="get">
[TextBox Name="Email", Background=Red]
[Button Type="Submit"]
</form>
</body>
</html>
and replace the [...] of course.
My question is how should I implement it, with ANTLR or with Regex? What will be faster? Note, that if I'm implementing it with ANTLR I think that I will need to implement XML, in addon to the [..].
I will need to implement parameters, etc.
EDIT: Please note that my regex can even look like something like that:
public override string ToString()
{
return Regex.Replace(Input, #"\[
\s*(?<name>\w+)\s*
(?<parameter>
[\s,]*
(?<paramName>\w+)
\s*
=
\s*
(
(?<paramValue>\w+)
|
(""(?<paramValue>[^""]*)"")
)
)*
\]", (match) =>
{
...
}, RegexOptions.IgnorePatternWhitespace);
}

Whether the correct tool is RegEx or ANTLR or even something else entirely should be heavily dependent on your requirements. The best answer to a "what tool to use" question shouldn't be primarily based on performance, but on the right tool for the job.
RegEx is a text search tool. If all you need to do is pull strings out of strings then it's often the hammer of choice. You'll likely want a tool to help you build your RegEx. I'd recommend Expresso, but there are lots of options out there.
ANTLR is a compiler generator. If you need error messages and parse actions or any of the complicated things that come with a compiler then it's a good option.
What it looks like you're doing is XML search/replace, have you considered XPath? That would be my suggestion.
Choosing the right tool for the job is definitely important, something that should be researched and thought out before development begins. In all cases, it's important to fully understand the program requirements before making any decisions. Do you have a specification for the project? If not, spending the time to come up with one will save you all the time that a poor tool choice can cost you.
Hope that helps!

About the performance of ANTLR vs. RegEx depends on the implementation of RegEx in C#. I know, from experience, that ANTLR is fast enough.
In ANTLR you can ignore certain content, like the XML. You can also seek for the [ and ] and go further with processing.
Both RegEx and ANTLR are supporting your kind of parameters (the "etc." I'm not sure about).
In terms of development speed: RegEx is slightly faster for such a case like this. You can use an online tool to develop the RegEx and see the capture-groups while you edit the RegEx. (Google # regex gskinner)
Then ANTLR has perfect support for "error-messages": they show line/column numbers and what was wrong. RegEx doesn't have this support.
A general approach for RegEx would be: create a "global scan" RegEx which will find correct [...] groups in your content. Then let the "..." be captuerd by a group, and then apply another RegEx for this smaller content (which splits content based on the equal-sign and commas). This way you have the best runtime performance and it's easy to develop.

If the language you are parsing is regular then regular expressions are certainly an option. If it is not then ANTLR may be your only choice. If I understand these matters correctly XML is not regular.

RegEx matching HTML tags and extracting text

I have a string of test like this:
<customtag>hey</customtag>
I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this:
<customtag>hey, this is changed!</customtag>
I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. Any help would be much appreciated.

I wouldn't use regex either for this, but if you must this expression should work:
<customtag>(.+?)</customtag>

I'd chew my own leg off before using a regular expression to parse and alter HTML.
Use XSL or DOM.
Two comments have asked me to clarify. The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. Regular expressions can match regular languages, i.e. a sequence of input which can be accepted by a finite state machine. HTML can contain nested tags to any arbitrary depth, so it's not a regular language.
What does this have to do with the question? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? What if a literal < character occurs in the text? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased.
Regular expressions are great tools and I do use them all the time. But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. It's practically inevitable that these cases grow beyond what regular expressions can handle. When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. Be ready to scrap the regular expression solution when the parsing requirements expand.
XSL and DOM are two standard technologies designed to work with XML or XHTML markup. Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content.
Here are a couple of articles on how to use XSL with C#:
http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=63
http://www.csharphelp.com/archives/archive78.html
Here are a couple of articles on how to use DOM with C#:
http://msdn.microsoft.com/en-us/library/aa290341%28VS.71%29.aspx
http://blogs.msdn.com/tims/archive/2007/06/13/programming-html-with-c.aspx
Here's a .NET library that assists DOM and XSL operations on HTML:
http://www.codeplex.com/Wiki/View.aspx?ProjectName=htmlagilitypack

If there won't be any other tags between the two tags, this regex is a little safer, and more efficient:
<customtag>[^<>]*</customtag>

Most people use HTML Agility Pack for HTML text parsing. However, I find it a little robust and complicated for my own needs. I create a web browser control in memory, load the page, and copy the text from it. (see example below)
You can find 3 simple examples here:
http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/

//This is to replace all HTML Text
var re = new RegExp("<[^>]*>", "g");
var x2 = Content.replace(re,"");
//This is to replace all
var x3 = x2.replace(/\u00a0/g,'');

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.