Regex against markup after XPath? - c#

Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.

Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.

You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo

Related

Using regex to capture everything except a certain (possibly repeated) pattern

I am trying to capture all of a string minus any occurrences of <span class="notranslate">*any text*</span> (i do NOT need to parse HTML or anything, i just need to ignore those whole sections. the tags must match exactly to be removed, because i want to keep other tags). In a given string there would be at least one tag, no upper limit (though more than a couple would be uncommon)
My ultimate goal is to match two texts, one where there are variable names and one where the variable names have been replaced with their values (can't replace the variables myself, I don't have access to that db). These variables will always be surrounded by the span tags I mentioned. I know my tags say "notranslate" - but this is pretranslation, so all of the other text will be exactly the same.
For example, if these are my two input texts:
Dear <span class="notranslate">$customer</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">$article431</span> and let me know if
that fixes your problem.
Dear <span class="notranslate">John Doe</span>, I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL <span class="notranslate">http://url.for.help/article</span> and
let me know if that fixes your problem.
I want the regex to return:
Dear , I am sorry that you are having trouble logging in. Please follow the instructions at this URL and let me know if that fixes your problem.
OR
Dear <span class="notranslate"></span>, I am sorry that you are having trouble logging in. Please follow the instructions at this URL <span class="notranslate"></span> and let me know if that fixes your problem.
For both of them, so I can easily do String.Equals() and find out if they are equal. (I will need to compare the input w/ variables against multiple texts where the variables have been replaced, to find the match)
I was easily able to come up with a regex that tells me whether a string has any "notranslate" sections in it: (<span class="notranslate">(.+?)</span>), which is how i decide whether i need to strip out sections before comparison. However I'm having a lot of trouble with the (I thought very similar) task above.
I am using Expresso and regexstorm.net to test, and have played with many variations of (?:(.+?)(?:<span class=\"notranslate\">(?:.+?)</span>)), using ideas from other SO questions, but with all of them I get problems that I don't understand. For example, that one seems to almost work in Expresso but it can't grab the end text after the last set of span tags; when i make the span tags optional or try to add another (.+?) at the end it won't grab anything at all? I have tried using lookaheads, but then I still end up grabbing the tags+internal text later.
This will capture all, then process out the matched html tags which are ignored.
string data = "Dear <span class=\"notranslate\">$customer</span>, I am sorry that you\r\n are havin" +
"g trouble logging in. Please follow the instructions at this\r\n URL <span class=" +
"\"notranslate\">$article431</span> and let me know if\r\n that fixes your problem.";
string pattern = #"(?<Words>[^<]+)(?<Ignore><[^>]+>[^>]+>)?";
Regex.Matches(data, pattern)
.OfType<Match>()
.Select(mt => mt.Groups["Words"].Value)
.Aggregate((sentance, words) => sentance + words );
The result is a string which has with the original carriage return and line feeds in your example actually:
Dear , I am sorry that you
are having trouble logging in. Please follow the instructions at this
URL and let me know if
that fixes your problem.

Get value between unknown string

I'm trying to pull out a string between 2 other strings. But to make it more complicated the proceeding contents will often differ.
The string I'm trying to retrieve is Christchurch.
The regex I have so far is (?<=300px">).*(?=</td) and it will pull out the string I'm looking fine but it will also return dozens of other strings through out the LARGE text file I'm searching.
What I'd like to do is limit the prefix to start seraching from Office:, all the way to 300px"> but, the contents between those 2 strings will sometimes differ dependant upon user preferences.
To put it in crude non regex terms I want to do the following: Starting at Office: all the way to 300px> find the string that starts here and ends with </td. Thus resulting in Christchurch.
Have you considered using the HTMLAgilityPack instead? It's a Nuget package for handling HTML which is able to handle malformed HTML pretty well. Most on Stack Overflow would recommend against using Regex for HTML - see here: RegEx match open tags except XHTML self-contained tags
Here's how you'd do it for your example:
using HtmlAgilityPack; //This is a nuget package!
var html = #"<tr >
<td align=""right"" valign=""top""><strong>Office:</strong> </td>
<td align=""left"" class=""stippel"" style=""white-space: wrap;max-width:300px"">Christchurch </td>
</tr>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.SelectSingleNode("//td[#class='stippel']");
Console.WriteLine(node.InnerHtml);
I haven't tested this code but it should do what you need.
I guess you need something like this:
office.*\n.*|(?<=300px">).*(?=<\/td)
The issue you're encountering is that * is greedy. Use the lazy/reluctant version *?.
Office:[\s\S]*?300px">(.*?)</td
This solution uses a group match rather than look-arounds.
Thanks to the posts from adamdc78 and greg I have the been able to come up with the below regex. This is exactly what I needed.
Thanks for you help.
(?<=office.*\n.*300px">).*(?=<\/td)

RegEx - HTML between two values

I am looking to get the html that is included between the following text:
<ul type="square">
</ul>
What's the most efficient way?
I always use XPath to do things like that.
Use an XPath that will extract the node and then you can fetch the InnerHTML from that node. Very clean, and the right tool for the job.
Additional details: The HAP Explorer is a nice tool for getting the XPath you need. Copy/paste the HTML into HAP Explorer, navigate to the node of interest, copy/paste the XPath for that node. Put that XPath string in a string resource, fetch it at runtime, apply it to the HTML document to extract the node, fetch the desired information from the node.
If you really want one:
#<ul type="square">(.*?)</ul>#im
I agree that an HTML parser is the correct way to solve this problem. But, to humor you and answer your original question purely for academic interest, I propose this:
/<[Uu][Ll] +type=("square"|square) *>((.*?(<ul[^>]*>.*</ul>)?)*)<\/[Uu][Ll]>/s
I'm sure there are cases where this will fail, but I can't think of any so please suggest /* them */ more.
Let me restate that I don't recommend you use this in your project. I am merely doing this out of academic interest, and as a demonstration of WHY a regex that parses html is bad and complicated.
Regular expressions should not be used to parse HTML!
This will definitely not work:
<ul type="square">(.*)</ul>

Regex Help (again)

I don't really know what to entitle this, but I need some help with regular expressions. Firstly, I want to clarify that I'm not trying to match HTML or XML, although it may look like it, it's not. The things below are part of a file format I use for a program I made to specify which details should be exported in that program. There is no hierarchy involved, just that each new line contains a 'tag':
<n>
This is matched with my program to find an enumeration, which tells my program to export the name value, anyway, I also have tags like this:
<adr:home>
This specifies the home address. I use the following regex:
<((?'TAG'.*):(?'SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
The problem is that the regex will split the adr:home tag fine, but fail to find the n tag because it lacks a colon, but when I add a ? or a *, it then doesn't split the adr:home and similar tags. Can anyone help? I'm sure it's only simple, it's just this is my first time at creating a regular expression. I'm working in C#, by the way.
Will this help
<((?'TAG'.*?)(?::(?'SUBTAG'.*))?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>
I've wrapped the : capture into a non capturing group round subtag and made the tag capture non greedy
Not entirely sure what your aim is but try this:
(?><)(?'TAG'[^:\s>]*)(:(?'SUBTAG'[^\s>:]*))?(\s\w+=['"](?'VALUE'[^'"]*)['"])?(?>>)
I find this site extremely useful for testing C# regex expressions.
What if you put the colon as part of the second tag?
<((?'TAG'.*)(?':SUBTAG'.*)?)?(\s+((\w+)=('|"")?(?'VALUE'.*[^'])('|"")?)?)?>

How to find all tags from string using RegEx using C#.net?

I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.

Categories

Resources