How to parse a text file with c#?

How to parse a text file with c#? - c#

How do I parse a Textfile like:
{:block1:}
%param1%= value1
%param2% = value2
%paramn% =valuen
{:block2:}
1st html - sourcecode Just copy 1:1
{:block3:}
2nd html - sourcecode Just copy 1:1
...{:block4:}
3rd html - sourcecode Just copy 1:1
I would like to convert data to a XmlDocument.
Blocks are identified by {::} and params are identified by %%=
Thanx a lot.
What I'm looking for is more an idea but complete code. I have found many examples reading ini-files using RegEx and a TextReader to get some lines. The problem is: It's possible, that more than one {:block:} is within a line. There are so many whitespaces, linebreaks...

If the problem is that more than one {:block:} can appear within a line, could you replace every "{" with a "\r\n{" to guarantee that every block is in its own line? (In other words, replace every "{" with a "newline{" ) would the extra spaces cause a problem? Otherwise, you could write a Regex expression to identify only those blocks where you need to enter a linebreak.
The whitespaces and line breaks are both handled with the Regex escape character \s. A common way to use \s in Regex is either as "\s+" or "\s*", depending on whether whitespace is optional or necessary.
It would also help if you were more specific about particular problems.

Related

Matching multiple lines up until a sepertor line?

Learning myself some Regex, while trying to parse a datasheet, and I'm thinking there's not an easy way (in Regex, I mean.. in C#, sure!) to do this. Say I have a file with the lines:
0000AA One Token - Value
0000AA Another Token- Another Value
0000AA YA Token - Yet Another
0000AA Yes, Another - Even More
0000AA
0000AA ______________________________________________________________________
0000AA This line - while it will match the regex, shouldn't.
So I have an easy multi-line regex:
^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*?)$
This loads All the 'Tokens' into 'token', and all the values into 'value' group. Pretty simple! However, the Regex ALSO matches the bottom line, putting 'This line' into the token, and 'while it will [...]' into the value.
Essentially, I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone, or will I need to modify my incoming string first to .Split() on the ____ separator line?
Cheers all --Mike.

Parsing such a text file with regex only would not be using the right tool for the job. Although possible, it would be both inefficient and unnecessarily complex.
I would actually not load all the text into a string and split on this line either, as it's not the most efficient way of doing this. I would rather read through the file in a loop, one line at a time, processing each line as needed. Then stop processing when you reach this particular line.

I'd like the regex to only match the lines above the ____ separator line. Would this be possible with Regex alone?
Sure it's possible. Add a lookahead to make sure such a line follows, something like:
(?=(?s).*^\w{6}[ \t]+_{4,})
Add this to the end of your expression to make sure that such a line follows. Eg:
(?m)^\s*[A-Z]{2}[0-9]{4}\s\s*(?<token>.*?)\-(?<value>.*)$(?=(?s).*^\w{6}[ \t]+_{4,})
(Also added m and s flags in the expression.)
This is not very efficient tho, as the regex engine will probably need to scan through most of the string for every match.

Regex to adjust HTML hrefs in c#

I need to use regex to search through an html file and replace href="pagename" with href="pages/pagename"
Also the href could be formatted like HREF = 'pagename'
I do not want to replace any hrefs that could be upper or lowercase that begin with http, ftp, mailto, javascript, #
I am using c# to develop this little app in.

HTML manipulation through Regex is not recommended since HTML is not a "regular language." I'd highly recommend using the HTML Agility Pack instead. That gives you a DOM interface for HTML.

I have not tested with many cases, but for this case it worked:
var str = "href='page' href = 'www.goo' href='http://' href='ftp://'";
Console.WriteLine(Regex.Replace(str, #"href ?= ?(('|"")([a-z0-9_#.-]+)('|""))", "x", RegexOptions.IgnoreCase));
Result:
"x x href='http://' href='ftp://'"
You better hold backup files before running this :P

There are lots of caveats when using a find/replace with HTML and XML. The problem is, there are many variations of syntax which are permitted. (and many which are not permitted but still work!)
But, you seem to want something like this:
search for
([Hh][Rr][Ee][Ff]\s*=\s*['"])(\w+)(['"])
This means:
[Hh]: any of the items in square-brackets, followed by
\s*: any number of whitespaces (maybe zero),
=
\s* any more whitespaces,
['"] either quote type,
\w+: a word (without any slashes or dots - if you want to include .html then use [.\w]+ instead ),
and ['"]: another quote of any kind.
replace with
$1pages/$2$3
Which means the things in the first bracket, then pages/, then the stuff in the second and third sets of brackets.
You will need to put the first string in #" quotes, and also escape the double-quotes as "".
Note that it won't do anything even vaguely intelligent, like making sure the quotes match. Warning: try never to use as "any character" (.) symbol in this kind of regex, as it will grab large sections of text, over and including the next quotation mark, possibly up to the end of the file!
see a regex tutorial for more info, e.g. http://www.regular-expressions.info/dotnet.html

Regex matching on to extract multi-line text regions (C#)

I'm looking to capture text regions in a large text block, created in the following format:
...
[region:region-name]
multi line
text block
[/region]
...
[region:another-region-name]
more
multi-line text
[/region]
I have this almost worked out with
\[region:(?'link'.*)\](?'text'(.|[\r\n])*)\[/region\]
This works if I only had one region in the entire text. But, when there are multiple, this gives me just one block with every other 'region' included in the 'text' of that one.
I have a feeling that this is to be solved using a negative look ahead, but being a non-pro with regex, I don't know how to modify the above to do it right.
Can someone help?

You can do this without lookahead:
\[region:(?'link'.*)\](?'text'(?s).*?)\[/region\]
The additional ? makes the * quantifier lazy, so it will match as few characters as possible. And the (?s) allows the dot to match newlines after this position, so you don't have to use the (.|[\r\n]) construction (an alternative would be [\s\S]).

You don't need a negative lookahead, just need to change (?'text'(.|[\r\n])*) to be "non-greedy", so that it will match the first instance of [/region] rather than the last. You can do this by adding ? after *, so the resulting pattern would be:
\[region:(?'link'.*)\](?'text'(.|[\r\n])*?)\[/region\]

regular expression should split , that are contained outside the double quotes in a CSV file?

This is the sample
"abc","abcsds","adbc,ds","abc"
Output should be
abc
abcsds
adbc,ds
abc

Try this:
"(.*?)"
if you need to put this regex inside a literal, don't forget to escape it:
Regex re = new Regex("\"(.*?)\"");

This is a tougher job than you realize -- not only can there be commas inside the quotes, but there can also be quotes inside the quotes. Two consecutive quotes inside of a quoted string does not signal the end of the string. Instead, it signals a quote embedded in the string, so for example:
"x", "y,""z"""
should be parsed as:
x
y,"z"
So, the basic sequence is something like this:
Find the first non-white-space character.
If it was a quote, read up to the next quote. Then read the next character.
Repeat until that next character is not also a quote.
If the next (non-whitespace) character is not a comma, input is malformed.
If it was not a quote, read up to the next comma.
Skip the comma, repeat the whole process for the next field.
Note that despite the tag, I'm not providing a regex -- I'm not at all sure I've seen a regex that can really handle this properly.

This answer has a C# solution for dealing with CSV.
In particular, the line
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
contains the Regex used to split properly, i.e., taking quoting and escaping into consideration.
Basically what it says is, match any comma that is followed by an even number of quote marks (including zero). This effectively prevents matching a comma that is part of a quoted string, since the quote character is escaped by doubling it.
Keep in mind that the quotes in the above line are doubled for the sake of the string literal. It might be easier to think of the expression as
,(?=(?:[^"]*"[^"]*")*(?![^"]*"))

If you can be sure there are no inner, escaped quotes, then I guess it's ok to use a regular expression for this. However, most modern languages already have proper CSV parsers.
Use a proper parser is the correct answer to this. Text::CSV for Perl, for example.
However, if you're dead set on using regular expressions, I'd suggest you "borrow" from some sort of module, like this one:
http://metacpan.org/pod/Regexp::Common::balanced

Regex for parsing Wikicode in C#

I try to parse articles from wikipedia. I use the *page-articles.xml file, where they backup all their articles in a wikicode-format. To strip the format and get the raw text, I try to use Regular Expressions, but I am not very used to it. I use C# as programming language.
I tried a bit around with Expresso, a designer for Regular Expressions, but I am at the end of my wits. Here is what I want to achieve:
The text can contain the following structures:
[[TextN]] or
[[Text1|TextN]] or
[[Text1|Text2|...|TextN]]
the [[ .... ]] pattern can appear within the Texti aswell. I want to replace these structure with TextN
For identifing the structures withhin the text I tried the following RegEx:
\[\[ ( .* \|?)* \]\]
Expresso seems to run and endless loop with this one. After 5 minutes for a relative small text, I canceled the Test Run.
Then I tried something more simple, I want to capture anything between the brackets:
\[\[ .* \]\]
but when I have a line like:
[[Word1]] text inbetween [[Word2]]
the expression returns the whole line, not
[[Word1]]
[[Word2]]
Any tips from Regex-Experts here to solve the problem?
Thanks in advance,
Frank

I wouldn't use regular expressions (since they don't handle recursion/nesting well).
Instead I would parse the text by hand*, which isn't particularly difficult in this case.
You could represent the text as a stream of elements whereas each element is either
a plain text chunk, or
a tag
A tag might contain multiple element streams, separated by |.
elementStream ::= element*
element ::= chunk | tag
chunk ::= TEXT
tag ::= "[[" elementStream otherStreams "]]"
otherStreams ::= "|" elementStream otherStreams
Your parser could represent each of those definitions with a method. So you'd have an elementStream method that would call element as long as there is text available and the next two characters are not "]]" or "|" (if you are inside a tag).
Each call to element would return the element parsed, either a chunk or a tag. etc.
This would essentially be a recursive descent parser.
Wikipedia: http://en.wikipedia.org/wiki/Recursive_descent_parser (the article is rather long/complicated, unfortunately)

\[\[(.*?\]\] would do it.
The key is the .*? which means get any characters but as few a possible.
EDIT
For nested tags one approach would be:
\[\[(?<text>(?>\[\[(?<Level>)|\]\](?<-Level>)|(?! \[\[ | \]\] ).)+(?(Level)(?!)))\]\]
This ensures that the [[ and ]] match across the text as well.

This is because regular expressions tries to find always the longest matches possible. You should change .*
Try using
\[\[([A-Za-z][A-Za-z\d+]*)(\|\1)*\]\]
This will match only letters, | sign and numbers in double brackets + it checks if value starts with the letter.

If Expresso isn't working out for you, you may want to try RegexBuddy.
While not free, it does provide an excellent real time testing environment where you can see how your regex is going to match a section of sample text.

If GPL2 is not an issue for you, maybe you could check out the source code of Screwturn Wiki and see how an expert does it. It's in C#, BTW

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to parse a text file with c#? - c#

Related

Matching multiple lines up until a sepertor line?

Regex to adjust HTML hrefs in c#

Regex matching on to extract multi-line text regions (C#)

regular expression should split , that are contained outside the double quotes in a CSV file?

Regex for parsing Wikicode in C#

Categories

Resources