I am looking to get the html that is included between the following text:
<ul type="square">
</ul>
What's the most efficient way?
I always use XPath to do things like that.
Use an XPath that will extract the node and then you can fetch the InnerHTML from that node. Very clean, and the right tool for the job.
Additional details: The HAP Explorer is a nice tool for getting the XPath you need. Copy/paste the HTML into HAP Explorer, navigate to the node of interest, copy/paste the XPath for that node. Put that XPath string in a string resource, fetch it at runtime, apply it to the HTML document to extract the node, fetch the desired information from the node.
If you really want one:
#<ul type="square">(.*?)</ul>#im
I agree that an HTML parser is the correct way to solve this problem. But, to humor you and answer your original question purely for academic interest, I propose this:
/<[Uu][Ll] +type=("square"|square) *>((.*?(<ul[^>]*>.*</ul>)?)*)<\/[Uu][Ll]>/s
I'm sure there are cases where this will fail, but I can't think of any so please suggest /* them */ more.
Let me restate that I don't recommend you use this in your project. I am merely doing this out of academic interest, and as a demonstration of WHY a regex that parses html is bad and complicated.
Regular expressions should not be used to parse HTML!
This will definitely not work:
<ul type="square">(.*)</ul>
Related
Have been searching for the solution to my problem now already for a while and have been playing around regex101.com for a while but cannot find a solution.
The problem I am facing is that I have to make a string select for different inputs, thus I wanted to do this with Regular expressions to get the wanted data from these strings.
The regular expression will come from a configuration for each string seperately. (since they differ)
The string below is gained with a XPath: //body/div/table/tbody/tr/td/p[5] but I cannot dig any lower into this anymore to retrieve the right data or can I ?
The string I am using at the moment as example is the following:
<strong>Kontaktdaten des Absenders:</strong>
<br>
<strong>Name:</strong> Wanted data
<br>
<strong>Telefon:</strong>
<a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
<br>
From this string I am trying to get the "Wanted data"
My regular expression so far is the following:
(?<=<\/strong> )(.*)(?= <br>)
But this returns the whole:
<br> <strong>Name:</strong> Wanted data <br> <strong>Telefon:</strong> <a dir='ltr' href='tel:XXXXXXXXX' x-apple-data-detectors='true' x-apple-data-detectors-type='telephone' x-apple-data-detectors-result='3'>XXXXXXXXX</a>
I thought I could solve this with a repeat group
((:?(?<=<\/strong> )(.*)(?= <br>))+)
But this returns the same output as without the repeat group.
I know I could build a for { } loop around this regex to gain the same output, but since this is the only regular expression I have to do this for (but means I have to change it for all the other data) I was wondering if it is possible to do this in a regular expression.
Thank you for the support already so far.
Regex is the wrong tool for parsing markup. You have a proper XML parsing tool, XPath, in hand. Finish the job with it:
This XPath,
strong[.='Name:']/following-sibling::text()[1]
when appended to your original XPath,
//body/div/table/tbody/tr/td/p[5]/strong[.='Name:']/following-sibling::text()[1]
will finish the job of selecting the text node immediately following the <strong>Name:</strong> label, as requested, with no regex hacks over markup required.
You can try to match everything but tag markers:
(?<=<\/strong> )([^<>]*)(?= <br>)
Demo
I am using HtmlAgilityPack. I create an HtmlDocument and LoadHtml with the following string:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One</option><option value="2">Two</option></select>
This does some unexpected things. First, it gives two parser errors, EndTagNotRequired. Second, the select node has 4 children - two for the option tags and two more for the inner text of the option tags. Last, the OuterHtml is like this:
<select id="foo_Bar" name="foo.Bar"><option selected="selected" value="1">One<option value="2">Two</select>
So basically it is deciding for me to drop the closing tags on the options. Let's leave aside for a moment whether it is proper and desirable to do that. I am using HtmlAgilityPack to test HTML generation code, so I don't want it to make any decision for me or give any errors unless the HTML is truly malformed. Is there some way to make it behave how I want? I tried setting some of the options for HtmlDocument, specifically:
doc.OptionAutoCloseOnEnd = false;
doc.OptionCheckSyntax = false;
doc.OptionFixNestedTags = false;
This is not working. If HtmlAgilityPack cannot do what I want, can you recommend something that can?
The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.
A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:
// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);
(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)
ADD
An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option"); before any use of liberary (without need to modify the liberary source code)
It seems that there is some reason not to parse the Option tag as a "generic" tag, for XHTML compliance, however this can be a real pain in the neck.
My suggestion is to do a whole-string-replace and change all "option" tags to "my_option" tags, that way you:
Don't have to modify the source of the library (and can upgrade it later).
Can parse as you usually would.
The original post on HtmlAgilityPack forum can be found at:
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=14982
Have a xml string, goal is to replace an xml element value to a fixed string, i.e. for blah blah blah replace it to fixed value, I am thinking to use RegEx.Replace instead of loading the string to a DOM model and replace.
Could anyone please help on how to write this regular expression? essentially the goal is to match everything inside element tag 'abc'
Thanks a lot!
This article tells you what you need to know: XML is not Regular
Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems.
Use regular expressions only on regular languages.
That said, there are many sites that purport to offer guidance on writing regular expressions for XML. They are all wrong. But they exist, and you can use them at your own risk.
For what it's worth, don't.
Process the XML normally, with a XmlDocument, Xml.Linq or XmlReader/Writer, it's what they are for, cover all kinds of edge cases we couldn't even imagine, and above all, are proven to work.
Don't use a regex for this, please . . . just don't.
My two cents.
let the downvoting begin
Regular expressions are meant to be used on regular languages. XML is a non-regular language. As such, regular expressions cannot be used to properly parse anything written in it. You will need to use a real XML parser, which can be found in the numerous libraries available in C#, to do it.
Regular expressions are not suitable for processing markup. Among other flaws, they won't work if elements can be nested:
<abc> ... <abc> ... </abc> ... </abc>
They are also unable to distinguish a comment from a non-comment.
You need a real XML parser.
I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.
I am looking for a regular expression to find all input fields of type hidden in html output. Anyone know an expression to do such?
I agree that the link Radomir suggest is correct that HTML should not be parsed with regular expressions. However, I do not agree that nothing meaningful can be gleaned from their use together. And the ensuing rant is totally counter-productive.
To correct Robert's RegEx:
<([^<]*)type=('|")hidden('|")>[^<]*(/>|</.+?>)
I know you asked for regular expression, but download Html Agility Pack and do the following:
var inputs = htmlDoc.DocumentNode.Descendants("input");
foreach (var input in inputs)
{
if( input.Attributes["type"].Value == "hidden" )
// do something
}
You can also use xpath with html agility pack.
Regular expressions are generally the wrong tool for the job when trying to search or manipulate HTML or XML; a parsing library would likely be a much cleaner and easier solution.
That said, if you're just looking through a big file and accuracy isn't critical, you can probably do reasonably well with something like <input[^>]*type="?hidden"?.