Regex to match XML elements in a text file [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have a text file consist of conversion instruction templates.
I need to parse this text file,
I need to match something like this:
(Source: <element>)
And get the "element".
Or this pattern:
(Source: <element attr="name" value=""/>)
And get "element attr="name"".
I am currently using this regex:
\(Source:\ ?\<(.*?)\>\)
Sorry for being a newbie. :)
Thanks for all your help.
-JRC

Try this Regex for detect attibs by both ” or " characters:
\(Source:\s+<(\w+\s+(?:\w+=[\"”][^\"”]+[\"”])?)[^>]*>\)
and your code:
var result = Regex.Match(strInput,
"\\(Source:\\s+<(\\w+\\s+(?:\\w+=[\"”][^\"”]+[\"”])?)[^>]*>")
.Groups[1].Value;
explain:
(subexpression)
Captures the matched subexpression and assigns it a zero-based ordinal number.
?
Matches the previous element zero or one time.
\w
Matches any word character.
+
Matches the previous element one or more times.

Related

Can regex match Interleaved matches? [duplicate]

This question already has answers here:
How to find overlapping matches with a regexp?
(4 answers)
Closed 5 years ago.
I have a pattern with opening tags and closing tags
e.g. /*tag1_START*/ some content /*tag1_END*/ other text /*tag2_START*/ some content /*tag2_END*/
and i use the Regex \/\*([a-zA-Z0-9]+)_START\*\/(.*?)\/\*\1_END\*
can see # regex101
BUT, There was a situation where the tags were interleaved (mistakingly):
e.g. /*tag3_START*/ some /*tag4_START*/ content /*tag3_END*/ other /*tag4_END*/ content
I can easily check the overlap in the matches, but REGEX does not return Both tags because it continue from the last char it matched...
Can i use Regex to find Overlapping matches or i need to write my own code ?
Lookarounds do assert rather than consume characters. However capturing groups still store matched parts in them. Just put overlapping part inside a positive lookahead:
\/\*([a-zA-Z0-9]+)_START\*\/(?=(.*?)\/\*\1_END\*)
Live demo
(?=\*([a-zA-Z0-9]+)_START\*\/(.*?)\/\*(\1)_END\*)
You will have to use lookahead and not capture anything.See demo.
https://regex101.com/r/vsA3ZU/1

What does regex expression match pattern "\\[.*\\]" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I am new to regex. What does regex expression match pattern "\[.*\]" mean?
If I have a text like "Hello [Here]", then success is returned in the match. And match contain [Here].
I read that:
. indicates Any except \n (newline),
* indicates 0 or more times
I don't understand the "\". It believe it is just escape sequence for "\".
So, is the expression "\[.*\]" trying to match a pattern like \[Any text\]?
Yes, you are right. It will match any characters enclosed in []. The .* imply any or no characters enclosed in [].
Also you should try this link which is a very helpful regex tool. You can input the regex pattern and check for matches easily.
I have tried this on regexr, here is a screen shot:

Regex match account number in PDF until new line [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 6 years ago.
I'm working on a pdf scraper in C# and I got stuck on a regex problem. I want to match just the account number and my regex statement is matching both the incorrect line and the correct line. I think I have to match everything until a new line but I can't find a way to do it.
This is my regex: ([A-Z0-9\-]{5,30})-[0-9]{1,10}-[0-9]{3}
XXX-XX-914026-1558513 // I don't want to match this line
130600298-110-528 // I want to match this line
Thanks in advance!
You have to add anchors:
^([A-Z0-9\-]{5,30})-[0-9]{1,10}-[0-9]{3}$
^ ^
Which mean start of line (^) and end of line ($).
If you don't, the match will be:
XXX-XX-914026-1558513
^^^^^^^^^^^^^^^^^
Also, you don't have to escape the caret in the end of a character class and you can use \d instead of [0-9]note: this will match numbers in any charset which gives:
^([A-Z0-9-]{5,30})-\d{1,10}-\d{3}$

Get words between "<" and ">" in .net [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
What is the best way to parse html in C#? [closed]
(15 answers)
Closed 7 years ago.
I have written a program to identify tags(between < and >) in a string. From the below string I am able to get <P>, <OL> and <LI> . Div is not getting any idea what I am doing wrong?
string yy = #"<P> </P><OL><LI><DIV align=center>fjsdhfsdjf</DIV></LI><LI>";
MatchCollection allMatchResults = null;
var regexObj = new Regex(#"<\w*>");
allMatchResults = regexObj.Matches(yy);
DIV is not begin matched because \w is not matching spaces. Use new Regex(#"<[^>]+>");
You are not getting Div because it has got attribute. Use .*? to include attributes or any text.
var regexObj = new Regex(#"<\w.*?>");
You can use Html Agility Pack to easily parse and manipulate the HTML.
\w* will match only alfanemeric characters.
Here problem lies in space and =
Quick solution:
<[^>]+> instead of <\w*>
But You may want to consider this:
RegEx match open tags except XHTML self-contained tags
Your regex is wrong, should be something like
#"<[^>]+>"
Also, if you have to do a lot of regexes like this, maybe it's better to use something like HTMLAgilityPack. It allows you to parse out the html into node lists that you can iterate through.
Samples can be found here.
I believe more in this method we are using this one daily where I work.
its a translation company so we translate xml, html, php files to different languages.
var myRegex= new Regex(#"(<[^>]+>)");
here is just the regex:
(<[^>]+>)

How do I write a regex to match a string that doesn't contain a word? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regular expression to match string not containing a word?
To not match a set of characters I would use e.g. [^\"\\r\\n]*
Now I want to not match a fixed character set, e.g. "|="
In other words, I want to match: ( not ", not \r, not \n, and not |= ).
EDIT: I am trying to modify the regex for parsing data separated with delimiters. The single-delimiter solution I got form a CSV parser, but now I want to expand it to include multi-character delimiters. I do not think lookaheads will work, because I want to consume, not just assert and discard, the matching characters.
I figured it out, it should be: ((?![\"\\r\\n]|[|][=]).)*
The full regex, modified from the CSV parser link in the original post, will be: ((?<field>((?![\"\\r\\n]|[|][=]).)*)|\"(?<field>([^\"]|\"\")*)\")([|][=]|(?<rowbreak>\\r\\n|\\n|$))
This will match any amount of characters of ( not ", not \r, not \n, and not |= ), or a quoted string, followed by ( "|=" or end of line )

Categories

Resources