Regex for parsing Wikicode in C# - c#

I try to parse articles from wikipedia. I use the *page-articles.xml file, where they backup all their articles in a wikicode-format. To strip the format and get the raw text, I try to use Regular Expressions, but I am not very used to it. I use C# as programming language.
I tried a bit around with Expresso, a designer for Regular Expressions, but I am at the end of my wits. Here is what I want to achieve:
The text can contain the following structures:
[[TextN]] or
[[Text1|TextN]] or
[[Text1|Text2|...|TextN]]
the [[ .... ]] pattern can appear within the Texti aswell. I want to replace these structure with TextN
For identifing the structures withhin the text I tried the following RegEx:
\[\[ ( .* \|?)* \]\]
Expresso seems to run and endless loop with this one. After 5 minutes for a relative small text, I canceled the Test Run.
Then I tried something more simple, I want to capture anything between the brackets:
\[\[ .* \]\]
but when I have a line like:
[[Word1]] text inbetween [[Word2]]
the expression returns the whole line, not
[[Word1]]
[[Word2]]
Any tips from Regex-Experts here to solve the problem?
Thanks in advance,
Frank

I wouldn't use regular expressions (since they don't handle recursion/nesting well).
Instead I would parse the text by hand*, which isn't particularly difficult in this case.
You could represent the text as a stream of elements whereas each element is either
a plain text chunk, or
a tag
A tag might contain multiple element streams, separated by |.
elementStream ::= element*
element ::= chunk | tag
chunk ::= TEXT
tag ::= "[[" elementStream otherStreams "]]"
otherStreams ::= "|" elementStream otherStreams
Your parser could represent each of those definitions with a method. So you'd have an elementStream method that would call element as long as there is text available and the next two characters are not "]]" or "|" (if you are inside a tag).
Each call to element would return the element parsed, either a chunk or a tag. etc.
This would essentially be a recursive descent parser.
Wikipedia: http://en.wikipedia.org/wiki/Recursive_descent_parser (the article is rather long/complicated, unfortunately)

\[\[(.*?\]\] would do it.
The key is the .*? which means get any characters but as few a possible.
EDIT
For nested tags one approach would be:
\[\[(?<text>(?>\[\[(?<Level>)|\]\](?<-Level>)|(?! \[\[ | \]\] ).)+(?(Level)(?!)))\]\]
This ensures that the [[ and ]] match across the text as well.

This is because regular expressions tries to find always the longest matches possible. You should change .*
Try using
\[\[([A-Za-z][A-Za-z\d+]*)(\|\1)*\]\]
This will match only letters, | sign and numbers in double brackets + it checks if value starts with the letter.

If Expresso isn't working out for you, you may want to try RegexBuddy.
While not free, it does provide an excellent real time testing environment where you can see how your regex is going to match a section of sample text.

If GPL2 is not an issue for you, maybe you could check out the source code of Screwturn Wiki and see how an expert does it. It's in C#, BTW

Related

Regex groups expression not capturing content

I'm trying to create a large regex expression where the plan is to capture 6 groups.
Is gonna be used to parse some Android log that have the following format:
2020-03-10T14:09:13.3250000 VERB CallingClass 17503 20870 Whatever content: this log line had (etc)
The expression I've created so far is the following:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w{+})\t(\d{5})\t(\d{5})\t(.*$)
The lines in this case are Tab separated, although the application that I'm developing will be dynamic to the point where this is not always the case, so regex I feel is still the best option even if heavier then performing a split.
Breaking down the groups in more detail from my though process:
Matches the date (I'm considering changing this to a x number of characters instead)
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})
Match a block of 4 characters
([A-Za-z]{4})
Match any number of characters until the next tab
(\w{+})
Match a block of 5 numbers 2 times
\t(\d{5})
At last, match everything else until the end of the line.
\t(.*$)
If I use a reduced expression to the following it works:
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(.*$)
This doesn't include 3 of the groups, the word and the 2 numbers blocks.
Any idea why is this?
Thank you.
The problem is \w{+} is going to match a word character followed by one or more { characters and then a final } character. If you want one or more word characters then just use plus without the curly braces (which are meant for specifying a specific number or number range, but will match literal curly braces if they do not adhere to that format).
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{7})\t([A-Za-z]{4})\t(\w+)\t(\d{5})\t(\d{5})\t(.*$)
I highly recommend using https://regex101.com/ for the explanation to see if your expression matches up with what you want spelled out in words. However for testing for use in C# you should use something else like http://regexstorm.net/tester

Removing comments using regex

I am building a parser, and I would like to remove comments from various lines. For example,
variable = "some//thing" ////actual comment
Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.
I can not figure out the regex which I should use instead. Thanks for any help.
Additional info - I am using C# with netcoreapp 1.1.0
Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.
Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.
The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.
When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.
One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.
The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.
This is the one-liner pattern to use:
^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)
If you want to use the ignore pattern whitespace option, you can use this version:
(?x) # Turn on the ignore white space option
^( # Start the only capturing group
(?: # A non-capturing group to allow for repeating the logic
(?: # Capture either of the two options below
[^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
| # OR
/(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
)* # As many times as possible, even if none
(?(" # Start a conditional match for double-quoted strings
(?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
) # Then
(?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
| # Otherwise
(?(' # Start a conditional match for single-quoted strings
(?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
) # Then
(?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
| # Otherwise
(?([^/]) # If next character is not a slash
.) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
) # end the conditional match for single-quoted strings
) # End the conditional match for double-quoted strings
)* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group
This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.
As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.
Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!
"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)
Flags: global
Matches full strings or a comment.
Group 1: comment.
So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.

How to parse a text file with c#?

How do I parse a Textfile like:
{:block1:}
%param1%= value1
%param2% = value2
%paramn% =valuen
{:block2:}
1st html - sourcecode Just copy 1:1
{:block3:}
2nd html - sourcecode Just copy 1:1
...{:block4:}
3rd html - sourcecode Just copy 1:1
I would like to convert data to a XmlDocument.
Blocks are identified by {::} and params are identified by %%=
Thanx a lot.
What I'm looking for is more an idea but complete code. I have found many examples reading ini-files using RegEx and a TextReader to get some lines. The problem is: It's possible, that more than one {:block:} is within a line. There are so many whitespaces, linebreaks...
If the problem is that more than one {:block:} can appear within a line, could you replace every "{" with a "\r\n{" to guarantee that every block is in its own line? (In other words, replace every "{" with a "newline{" ) would the extra spaces cause a problem? Otherwise, you could write a Regex expression to identify only those blocks where you need to enter a linebreak.
The whitespaces and line breaks are both handled with the Regex escape character \s. A common way to use \s in Regex is either as "\s+" or "\s*", depending on whether whitespace is optional or necessary.
It would also help if you were more specific about particular problems.

Regex to adjust HTML hrefs in c#

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.
HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.
I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P
There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

I have two problems, one of them is a regex

I am updating some code that I didn't write and part of it is a regex as follows:
\[url(?:\s*)\]www\.(.*?)\[/url(?:\s*)\]
I understand that .*? does a non-greedy match of everything in the second register.
What does ?:\s* in the first and third registers do?
Update: As requested, language is C# on .NET 3.5
The syntax (?:) is a way of putting parentheses around a subexpression without separately extracting that part of the string.
The author wanted to match the (.*?) part in the middle, and didn't want the spaces at the beginning or the end from getting in the way. Now you can use \1 or $1 (or whatever the appropriate method is in your particular language) to refer to the domain name, instead of the first chunk of spaces at the beginning of the string
?: makes the parentheses non-grouping. In that regex, you'll only pull out one piece of information, $1, which contains the middle (.*?) expression.
What does ?:\s* in the first and third registers do?
It's matching zero or more whitespace characters, without capturing them.
The regex author intends to allow trailing whitespace in the square-bracket-tags, matching all DNS labels following the "www." like so:
[url]www.foo.com[/url] # foo.com
[url ]www.foo.com[/url ] # same
[url ]www.foo.com[/url] # same
[url]www.foo.com[/url ] # same
Note that the regex also matches:
[url]www.[/url] # empty string!
and fails to match
[url]stackoverflow.com[/url] # no match, bummer
You may find this Regular Expressions Cheat Sheet very helpful (hopefully). I spent ages trying to learn Regex with no luck. And once I read this cheat-sheet - I immediately understood what I previously failed to learn.
http://krijnhoetmer.nl/stuff/regex/cheat-sheet/

Categories

Resources