Regular Expression to Extract HTML Body Content - c#

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
</title>
</head>
<body contenteditable="true">
<p>
Example paragraph content
</p>
<p>
</p>
<p>
<br />
</p>
<h1>Header 1</h1>
</body>
</html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

XHTML would be more easily parsed with an XML parser, than with a regex. I know it's not what youre asking, but an XML parser would be able to quickly navigate to the body node and give you back its content without any tag mapping problems that the regex is giving you.
EDIT:
In response to a comment here; that an XML parser is too slow.
There are two kinds of XML parser, one called DOM is big and heavy and easy and friendly, it builds a tree out of the document before you can do anything. The other is called SAX and is fast and light and more work, it reads the file sequentially. You will want SAX to find the Body tag.
The DOM method is good for multiple uses, pulling tags and finding who is what's child. The SAX parser reads across the file in order and qill quickly get the information you are after. The Regex won't be any faster than a SAX parser, because they both simply walk across the file and pattern match, with the exception that the regex won't quit looking after it has found a body tag, because regex has no built in knowledge of XML. In fact, your SAX parser probably uses small pieces of regex to find each tag.

String toMatch="aaaaaaaaaaabcxx sldjfkvnlkfd <body>i m avinash</body>";
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?");
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}

Why can't you just split it by
</{0,1}body[^>]*>
and take the second string? I believe it will be much faster than looking for a huge regexp.

/<body[^>]*>(.*)</body>/s
replace with
\1

Match the first body tag: <\s*body.*?>
Match the last body tag: <\s*/\s*body.*?>
(note: we account for spaces in the middle of the tags, which is completely valid markup btw)
Combine them together like this and you will get everything in-between, including the body tags: <\s*body.*?>.*?<\s*/\s*body.*?>. And make sure you are using Singleline mode which will ignore line breaks.
This works in VB.NET, and hopefully others too!

Related

Regex to extract pure text within specific HTML tag [duplicate]

This question already has answers here:
Regex select all text between tags
(23 answers)
Closed 2 years ago.
In this case, I am supposed to only use a single regex match.
See the following HTML code:
<html>
<body>
<p>This is some <strong>strong</strong> text</p>
</body>
</html>
I want to make a regex that can return This is some strong text. In this case, the text inside the <p> tag.
Overall, it should:
Match only text between two HTML tags.
Exclude HTML tags within the two tags, but keep the text inside those tags.
So far I know:
<p>(.*)<\/p> Will match the region from <p> to </p>
<[^>]*> Will match any HTML tag
The hard part for me is how to combine the two (maybe there is an even better way of doing it).
How would you write such regex?
How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML
The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.
If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:
1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)
2) using a simple Regex.Replace to strip out all tag content
let html = #"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"
for m in Regex.Matches(html, #"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))
(This is some strong text)
(This is some reallystrong text)
If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)
let rx = #"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(#"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture
p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text
Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel
Following #Jimmy's answer, and going with the title of post on how to "extract" the text, I thought I would include the C# code for the Regex.Replace.
This bit of code should work to extract the text:
string HTML = "<html><body><p>This is some <strong>strong</strong> text</p></body></html>";
Regex Reg = new Regex("<[^>]*>");
String parsedText = Reg.Replace(HTML, "").Trim();
MessageBox.Show(parsedText);
Obviously this does not match between the two tags exclusively (it would grab anything outside the paragraph tags as well), but I would suggest that the replace function is the best option in making only ONE match.
If you need to get only the content between the two tags, I think you would need to do that in two expressions, as #Jimmy suggested.
I would be very curious to see if anyone could get it all in one expression, but I'm guessing this is what they are looking for at your school.

AngleSharp and XHTML round-trip

I'm trying to parse an XHTML file using AngleSharp, make a change, then output it. However, I'm having some issues getting the output to match the input.
If I use the XML parser and either the XMLMarkupFormatter or the HtmlMarkupFormatter I get no self-closing tags (all are <img></img>) and no XML declaration.
If I use the HTML parser and the HTMLMarkupFormatter I get XML invalid self-closing tags (all are simply <img>) and no XML declaration.
If I use the HTML parser and the XMLMarkupFormatter I get nice self closing tags (<img />), and the XML declaration - however, the XML declaration is picked up as a comment and outputted as <!-- <?xml version="1.0" encoding="UTF-8"?> -->
Is there a way around this or do I need to write my own MarkupFormatter?
Simple answer: It sounds like you need to provide your own MarkupFormatter.
There has been some effort to come up with an XhtmlMarkupFormatter, but this component has unfortunately not been realized so far. I imagine such a component may combine the serialization from both, the existing HTML and the available XML formatter.
Maybe this issue on the AngleSharp repo helps you.

XPath to first occurrence of element with text length >= 200 characters

How do I get the first element that has an inner text (plain text, discarding other children) of 200 or more characters in length?
I'm trying to create an HTML parser like Embed.ly and I've set up a system of fallbacks where I first check for og:description, then I would search for this occurrence and only then for the description meta tag.
This is because most sites that even include meta description describe their site in that tag, instead of the contents of the current page.
Example:
<html>
<body>
<div>some characters
<p>200 characters <span>some more stuff</span></p>
</div>
</body>
</html>
What selector could I use to get the 200 characters portion of that HTML fragment? I don't want the some more stuff either, I don't care what element it is (except for <script> or <style>), as long as it's the first plain text to contain at least 200 characters.
What should the XPath query look like?
Use:
(//*[not(self::script or self::style)]/text()[string-length() > 200])[1]
Note: In case the document is an XHTML document (and that means all elements are in the xhrml namespace), the above expression should be specified as:
(//*[not(self::x:script or self::x:style)]/text()[string-length() > 200])[1]
where the prefix "x:" must be bound to the XHTML namespace -- "http://www.w3.org/1999/xhtml" (or as many XPath APIs call this -- the namespace must be "Registered" with this prefix)
I meant something like this:
root.SelectNodes("html/body/.//*[(name() !='script') and (name()!='style')]/text()[string-length() > 200]")
Seems to work pretty well.
HTML is not XML. You should not use XML parsers to parse HTML period. They are two different things entirely, and your parser will choke out the first time you see html that's not well formed XML.
You should find an opensource HTML parser instead of rolling your own.

HTML Agility Pack Parsing With Upper & Lower Case Tags?

I am using the HTML Agility Pack to great effect, and am really impressed with it - However, I am selecting content like so
doc.DocumentNode.SelectSingleNode("//body").InnerHtml
How to I deal with the following situation, with different documents?
<body>
<Body>
<BODY>
Will my code above only get the lower case versions?
The Html Agility Pack handles HTML in a case insensitive way. It means it will parse BODY, Body and body the same way. It's by design since HTML is not case sensitive (XHTML is).
That said, when you use its XPATH feature, you must use tags written in lower case. It means the "//body" expression will match BODY, Body and body, and "//BODY" will match nothing.

How to find all tags from string using RegEx using C#.net?

I want to find all HTML tags from the input strings and removed/replace with some text.
suppose that I have string
INPUT=>
<img align="right" src="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg" /><p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, Il Giardino Ristorante in Newport Beach.</p>
OUTPUT=>
string strSrc="http://www.groupon.com/images/site_images/0623/2541/Ten-Restaurant-Group_IL-Giardino-Ristorante2.jpg";
<p>Although Italians originally invented pasta as a fastener to keep Sicily from floating away, http://www.tenrestaurantgroup.com in Newport Beach.</p>
From above string
if <IMG> tag found then I want to get SRC of the tag,
if <A> tag found then I want get HREF from the tag.
and all other tag as same it is..
How can I achieved using Regex in C#.net?
You really, really shouldn't use regex for this. In fact, parsing HTML cannot be done perfectly with regex. Have you considered using an XML parser or HTML DOM library?
You can use HtmlAgilityPack for parsing (valid/non valid) html and get what you want.
I agree with Justin, Regex really isn't the best way to do this, and the HTML Agility is well worth a look if this is something you will need to be doing alot of.
With that said, the expression below will store attributes into a group from where you should be able to pull them into your text while ignoring the rest of the element. :
</?([^ >]+)( [^=]+?="(.+?)")*>
Hope this helps.

Categories

Resources