Get colored texts within HTML code - c#

I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...

I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.

It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

Related

Populating a span with a string including a linked email [duplicate]

What is the difference between innerHTML, innerText and value in JavaScript?
The examples below refer to the following HTML snippet:
<div id="test">
Warning: This element contains <code>code</code> and <strong>strong language</strong>.
</div>
The node will be referenced by the following JavaScript:
var x = document.getElementById('test');
element.innerHTML
Sets or gets the HTML syntax describing the element's descendants
x.innerHTML
// => "
// => Warning: This element contains <code>code</code> and <strong>strong language</strong>.
// => "
This is part of the W3C's DOM Parsing and Serialization Specification. Note it's a property of Element objects.
node.innerText
Sets or gets the text between the start and end tags of the object
x.innerText
// => "Warning: This element contains code and strong language."
innerText was introduced by Microsoft and was for a while unsupported by Firefox. In August of 2016, innerText was adopted by the WHATWG and was added to Firefox in v45.
innerText gives you a style-aware, representation of the text that tries to match what's rendered in by the browser this means:
innerText applies text-transform and white-space rules
innerText trims white space between lines and adds line breaks between items
innerText will not return text for invisible items
innerText will return textContent for elements that are never rendered like <style /> and `
Property of Node elements
node.textContent
Gets or sets the text content of a node and its descendants.
x.textContent
// => "
// => Warning: This element contains code and strong language.
// => "
While this is a W3C standard, it is not supported by IE < 9.
Is not aware of styling and will therefore return content hidden by CSS
Does not trigger a reflow (therefore more performant)
Property of Node elements
node.value
This one depends on the element that you've targeted. For the above example, x returns an HTMLDivElement object, which does not have a value property defined.
x.value // => null
Input tags (<input />), for example, do define a value property, which refers to the "current value in the control".
<input id="example-input" type="text" value="default" />
<script>
document.getElementById('example-input').value //=> "default"
// User changes input to "something"
document.getElementById('example-input').value //=> "something"
</script>
From the docs:
Note: for certain input types the returned value might not match the
value the user has entered. For example, if the user enters a
non-numeric value into an <input type="number">, the returned value
might be an empty string instead.
Sample Script
Here's an example which shows the output for the HTML presented above:
var properties = ['innerHTML', 'innerText', 'textContent', 'value'];
// Writes to textarea#output and console
function log(obj) {
console.log(obj);
var currValue = document.getElementById('output').value;
document.getElementById('output').value = (currValue ? currValue + '\n' : '') + obj;
}
// Logs property as [propName]value[/propertyName]
function logProperty(obj, property) {
var value = obj[property];
log('[' + property + ']' + value + '[/' + property + ']');
}
// Main
log('=============== ' + properties.join(' ') + ' ===============');
for (var i = 0; i < properties.length; i++) {
logProperty(document.getElementById('test'), properties[i]);
}
<div id="test">
Warning: This element contains <code>code</code> and <strong>strong language</strong>.
</div>
<textarea id="output" rows="12" cols="80" style="font-family: monospace;"></textarea>
Unlike innerText, though, innerHTML lets you work with HTML rich text and doesn't automatically encode and decode text. In other words, innerText retrieves and sets the content of the tag as plain text, whereas innerHTML retrieves and sets the content in HTML format.
InnerText property html-encodes the content, turning <p> to <p>, etc. If you want to insert HTML tags you need to use InnerHTML.
In simple words:
innerText will show the value as is and ignores any HTML formatting which may
be included.
innerHTML will show the value and apply any HTML formatting.
Both innerText and innerHTML return internal part of an HTML element.
The only difference between innerText and innerHTML is that: innerText return HTML element (entire code) as a string and display HTML element on the screen (as HTML code), while innerHTML return only text content of the HTML element.
Look at the example below to understand better. Run the code below.
const ourstring = 'My name is <b class="name">Satish chandra Gupta</b>.';
document.getElementById('innertext').innerText = ourstring;
document.getElementById('innerhtml').innerHTML = ourstring;
.name {
color:red;
}
<p><b>Inner text below. It inject string as it is into the element.</b></p>
<p id="innertext"></p>
<br>
<p><b>Inner html below. It renders the string into the element and treat as part of html document.</b></p>
<p id="innerhtml"></p>
var element = document.getElementById("main");
var values = element.childNodes[1].innerText;
alert('the value is:' + values);
To further refine it and retrieve the value Alec for example, use another .childNodes[1]
var element = document.getElementById("main");
var values = element.childNodes[1].childNodes[1].innerText;
alert('the value is:' + values);
In terms of MutationObservers, setting innerHTML generates a childList mutation due to the browsers removing the node and then adding a new node with the value of innerHTML.
If you set innerText, a characterData mutation is generated.
innerText property sets or returns the text content as plain text of the specified node, and all its descendants, whereas the innerHTML property gets and sets the plain text or HTML contents in the elements. Unlike innerText, innerHTML lets you work with HTML rich text and doesn’t automatically encode and decode text.
InnerText will only return the text value of the page with each element on a newline in plain text, while innerHTML will return the HTML content of everything inside the body tag, and childNodes will return a list of nodes, as the name suggests.
The innerText property returns the actual text value of an html element while the innerHTML returns the HTML content. Example below:
var element = document.getElementById('hello');
element.innerText = '<strong> hello world </strong>';
console.log('The innerText property will not parse the html tags as html tags but as normal text:\n' + element.innerText);
console.log('The innerHTML element property will encode the html tags found inside the text of the element:\n' + element.innerHTML);
element.innerHTML = '<strong> hello world </strong>';
console.log('The <strong> tag we put above has been parsed using the innerHTML property so the .innerText will not show them \n ' + element.innerText);
console.log(element.innerHTML);
<p id="hello"> Hello world
</p>
To add to the list, innerText will keep your text-transform, innerHTML wont.
#rule:
innerHTML
write: whatever String you write to the ele.innerHTML, ele (the code of the element in the html file) will be exactly same as it is written in the String.
read : whatever you read from the ele.innerHTML to a String, the String will be exactly same as it is in ele (the html file).
=> .innerHTML will not make any modification for your read/write
innerText
write: when you write a String to the ele.innerText, any html reserved special character in the String will be encoded into html format first, then stored into the ele.
eg: <p> in your String will become <p> in the ele
read : when you read from the ele.innerText to a String,
any html reserved special character in the ele will be decoded back into a readable text format,
eg: <p> in the ele will become back into <p> in your String
any (valid) html tag in the ele will be removed -- so it becomes "plain text"
eg: if <em>you</em> can in the ele will become if you can in your String
about invalid html tag
if there is an invalid html tag originally in the ele (the html code), and you read from.innerText, how does the tag gets removed?
-- this ("if there is an invalid html tag originally") should not (is not possible to) happen
but its possible that you write an invalid html tag by .innerHTML (in raw) into ele -- then, this may be auto fixed by the browser.
dont take (-interpret) this as step [1.] [2.] with an order
-- no, take it as step [1.] [2.] are executed at the same time
-- I mean, if the decoded characters in [1.] will form a new tag after the conversion, [2.] does not remove it
(-- cuz [2.] considers what characters are in the ele during the conversion, not the characters they become into after the conversion)
then stored into the String.
jsfiddle: with explanation
(^ this contains much more explanations in comments of the js file, + output in console.log
below is a simplified view, with some output.
(try out the code yourself, also there is no guarantee that my explanations are 100% correct.))
<p id="mainContent">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlWrite"></p>
<p id="textWrite"></p>
// > #basic (simple)
// read
var ele_mainContent = document.getElementById('mainContent');
alert(ele_mainContent.innerHTML); // This is a <strong>sample</strong> sentennce for Reading. // >" + => `.innerHTML` will **not make any modification** for your read/write
alert(ele_mainContent.innerText); // This is a sample sentennce for Reading. // >" 2. any (valid) `html tag` in the `ele` will be **removed** -- so it becomes "plain text"
// write
var str_WriteOutput = "Write <strong>this</strong> sentence to the output.";
var ele_htmlWrite = document.getElementById('htmlWrite');
var ele_textWrite = document.getElementById('textWrite');
ele_htmlWrite.innerHTML = str_WriteOutput;
ele_textWrite.innerText = str_WriteOutput;
alert(ele_htmlWrite.innerHTML); // Write <strong>this</strong> sentence to the output. // >" + => `.innerHTML` will **not make any modification** for your read/write
alert(ele_htmlWrite.innerText); // Write this sentence to the output. // >" 2. any (valid) `html tag` in the `ele` will be **removed** -- so it becomes "plain text"
alert(ele_textWrite.innerHTML); // Write <strong>this</strong> sentence to the output. // >" any `html reserved special character` in the String will be **encoded** into html format first
alert(ele_textWrite.innerText); // Write <strong>this</strong> sentence to the output. // >" 1. any `html reserved special character` in the `ele` will be **decoded** back into a readable text format,
// > #basic (more)
// write - with html encoded char
var str_WriteOutput_encodedChar = "What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?";
var ele_htmlWrite_encodedChar = document.getElementById('htmlWrite_encodedChar');
var ele_textWrite_encodedChar = document.getElementById('textWrite_encodedChar');
ele_htmlWrite_encodedChar.innerHTML = str_WriteOutput_encodedChar;
ele_textWrite_encodedChar.innerText = str_WriteOutput_encodedChar;
alert(ele_htmlWrite_encodedChar.innerHTML); // What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?
alert(ele_htmlWrite_encodedChar.innerText); // What if you have <strong>encoded</strong> char in the sentence?
alert(ele_textWrite_encodedChar.innerHTML); // What if you have &lt;strong&gt;encoded&lt;/strong&gt; char in <strong>the</strong> sentence?
alert(ele_textWrite_encodedChar.innerText); // What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?
// > #note-advance: read then write
var ele__htmlRead_Then_htmlWrite = document.getElementById('htmlRead_Then_htmlWrite');
var ele__htmlRead_Then_textWrite = document.getElementById('htmlRead_Then_textWrite');
var ele__textRead_Then_htmlWrite = document.getElementById('textRead_Then_htmlWrite');
var ele__textRead_Then_textWrite = document.getElementById('textRead_Then_textWrite');
ele__htmlRead_Then_htmlWrite.innerHTML = ele_mainContent.innerHTML;
ele__htmlRead_Then_textWrite.innerText = ele_mainContent.innerHTML;
ele__textRead_Then_htmlWrite.innerHTML = ele_mainContent.innerText;
ele__textRead_Then_textWrite.innerText = ele_mainContent.innerText;
alert(ele__htmlRead_Then_htmlWrite.innerHTML); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__htmlRead_Then_htmlWrite.innerText); // This is a sample sentennce for Reading.
alert(ele__htmlRead_Then_textWrite.innerHTML); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__htmlRead_Then_textWrite.innerText); // This is a <strong>sample</strong> sentennce for Reading.
alert(ele__textRead_Then_htmlWrite.innerHTML); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_htmlWrite.innerText); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_textWrite.innerHTML); // This is a sample sentennce for Reading.
alert(ele__textRead_Then_textWrite.innerText); // This is a sample sentennce for Reading.
// the parsed html after js is executed
/*
<html><head>
<meta charset="utf-8">
<title>Test</title>
</head>
<body>
<p id="mainContent">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlWrite">Write <strong>this</strong> sentence to the output.</p>
<p id="textWrite">Write <strong>this</strong> sentence to the output.</p>
<!-- P2 -->
<p id="htmlWrite_encodedChar">What if you have <strong>encoded</strong> char in <strong>the</strong> sentence?</p>
<p id="textWrite_encodedChar">What if you have &lt;strong&gt;encoded&lt;/strong&gt; char in <strong>the</strong> sentence?</p>
<!-- P3 #note: -->
<p id="htmlRead_Then_htmlWrite">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="htmlRead_Then_textWrite">This is a <strong>sample</strong> sentennce for Reading.</p>
<p id="textRead_Then_htmlWrite">This is a sample sentennce for Reading.</p>
<p id="textRead_Then_textWrite">This is a sample sentennce for Reading.</p>
</body></html>
*/
innerhtml will apply html codes
innertext will put content as text so if you have html tags it will show as text only
1)innerHtml
sets all the html content inside the tag
returns all the html content inside the tag
includes styling + whitespaces
2)innerText
sets all the content inside the tag (with tag wise line breaks)
returns all html content inside the tag (with tag wise line breaks)
ignores tags (shows only text)
ignores styling + whitespaces
if we have style:"visibility:hidden;" inside tag
|_ innerText includes the styling -> hides content
3)textContent
sets all the content inside the tag (no tag wise line breaks)
returns all content inside the tag (no tag wise line breaks)
includes whitespaces
if we have style:"visibility:hidden;" inside tag
|_ textContent ignores the styling -> shows content
textContent has better performance because its value is not parsed as HTML.

parse pages of online book and save content of pages and its footers without any changes

<article class="js_IntraTCBP IntraTCBP dr tr lh2 js_lblContent" id="js_lblContent"><p></p>text
<p></p><p></p><a name="p1"></a><a name="p1"></a><a name="p1"></a><a name="p1"></a><a name="p1"></a><h1>text</h1><p></p><p></p>text
<p></p>text<sup>1</sup>
<p></p>text<sup>2</sup>
<p></p>text<sup>3</sup>
<p></p>text<sup>4</sup>text<p></p><hr class="Footer"><p></p><font class="Footer"><p></p>1-ddd
<p></p>2-ccc
<p></p>3-bbb
<p></p>4-aaa
</font></article>
text
texttext
text1
text2
text3
text4text1-ddd
2-ccc
3-bbb
4-aaa
I want to parse pages of online book and save content of pages without any changes.
when I use this:
var pageContent = document.DocumentNode.SelectNodes("//article[#class='js_IntraTCBP IntraTCBP dr tr lh2 js_lblContent']/text()");
it get me all 'text's.
how can I get all footers. for example text1 ----> 1=dddd. like what I see in book's page.
You could try regular expressions or Regex, which are sequences of characters and symbols which express a string or pattern to search for. System.Text.RegularExpressions.Regex class - MSDN.
You can use Regex.Matches to match some html elements, but you have to loop through each line. This will get you started:
// loop...
var matches = Regex.Match(line, #"(\<[\w]*\>|[^\s]*([^<]*)\<\/[\w]*\>)");
To get the tag including content use:
string tag = matches.Groups[1].Value;
To get the content no incuding the tag use:
string content = matches.Groups[2].Value;
Demo. It can detect some elements but not all.
Here are some links that may help with learning it:
https://msdn.microsoft.com/en-us/library/20bw873z.aspx
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx
http://www.dotnetperls.com/regex

Extract content in paragraph Tags

I have following html in string and i have to extract the content only in Paragraph tags any ideas??
link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml
I have tried
const string HTML_TAG_PATTERN = "<[^>]+.*?>";
static string StripHTML(string inputString)
{
return Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);
}
it removes all html tags but i dont want to remove all the tags because this is the way how i can get content like paragraph by tags
secondly it makes line breaks to \n in text and and applying replace("\n","") dose not helps
one problem is that when i apply
int UrlStart = e.Result.IndexOf("<p>"), urlEnd = e.Result.IndexOf("<p> </p></td>\r" );
string paragraph = e.Result.Substring(UrlStart, urlEnd);
extractedContent.Text = paragraph.Replace(Environment.NewLine, "");
<p> </p></td>\r this appears at the end of paragraph but urlEnd dose not makes sure only paragraph is shown
the string extracted is shown in visual studio is like this
this page is downloaded by Webclient
End of HTMLpage
We will provide ourselves with ropes of\rsuitable length and strength- and- pardon me- you must not\rdrink more to-night. our hands and feet must be steady and\rfirm tomorrow.\"\r<p> </p> </td>\r </tr>\r\r <tr>\r <td height=\"25\" width=\"10%\">\r \r </td><td height=\"25\" width=\"80%\" align=\"center\">\r <font color=\"#FFFFFF\">\r <font size=\"4\">1</font> \r </font></td>\r <td height=\"25\" width=\"10%\" align=\"right\">Next</td>\r </tr>\r </table>\r </center>\r</div>\r<p align=\"center\"><b>The Coming Race -by- Edward Bulwer Lytton</b></p>\r<P><B><center>Encyclopedia - Books - Religion<a/> - <A HREF=\"http://www.public-domain-content.com/links2.shtml\">Links - Home - Message Boards</B><BR>This Wikipedia content is licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">GNU Fr
Don't use regular expressions to parse HTML. Use the HTML Agility Pack (or something similar) instead.
A quick example, but you could do something like this:
HtmlDocument document = new HtmlDocument();
document.Load("your_file_here.htm");
foreach(HtmlNode paragraph in document.DocumentElement.SelectNodes("//p"))
{
// do something with the paragraph node here
string content = paragraph.InnerText; // or something similar
}

How to parse HTML to modify all words

This seems to be a recurring question, but here goes.
I have HTML which is well-formatted (it comes from a controlled source, so this can be taken to be a given). I need to iterate through the contents of the body of the HTML, look for all the words in the document, perform some editing on those words, and save the results.
For example, I have file sample.html and I want to run it through my application and product output.html, which is exactly the same as the original, plus my edits.
I found the following using HTMLAgilityPack, but all the examples I've found look at the attributes of the specified tags - is there an easy modification that will look at the contents and perform my edits?
HtmlDocument HD = new HtmlDocument();
HD.Load (#"e:\test.htm");
var NoAltElements = HD.DocumentNode.SelectNodes("//img[not(#alt)]");
if (NoAltElements != null)
{
foreach (HtmlNode HN in NoAltElements)
{
HN.Attributes.Append("alt", "no alt image");
}
}
HD.Save(#"e:\test.htm");
The above looks for image tags with no ALT tags. I want to look for all tags in the <body> of the file and do something with the contents (which may involve creating new tags in the process).
A very simple sample of what I might do is take the following input:
<html>
<head><title>Some Title</title></head>
<body>
<h1>This is my page</h1>
<p>This is a paragraph of text.</p>
</body>
</html>
and produce the output, which takes every word and alternates between making it uppercase and making it italics:
<html>
<head><title>Some Title</title></head>
<body>
<h1>THIS <em>is</em> MY <em>page</em></h1>
<p>THIS <em>is</em> A <em>paragraph</em> OF <em>text</em>.</p>
</body>
</html>
Ideas, suggestions?
Personally, given this setup, I'd work with the InnerText property of HtmlNode to find the words (probably with Regex so I can exclude for punctuation and not simply rely on spaces) and then use the InnerHtml property to make the changes using iterative calls to Regex.Replace (because the Regex.Replace has a method that allows you to specify both start position and number of times to replace).
Processing code:
IEnumerable<HtmlNode> nodes = doc.DocumentNode.DescendantNodes().Where(n => n.InnerText == "something");
foreach (HtmlNode node in nodes)
{
string[] words = getWords(node.InnerText);
node.InnerHtml = processHtml(node.InnerHtml, words);
}
identify words (there's probably some slicker way to do this but here's an initial stab):
private string[] getWords(string text)
{
Regex reg = new Regex("/w+");
MatchCollection matches = reg.Matches(text);
List<string> words = new List<string>();
foreach (Match match in matches)
{
words.Add(match.Value);
}
return words.ToArray();
}
process the html:
private string processHtml(string html, string[] words)
{
int startPosition = 0;
foreach (string word in words)
{
startPosition = html.IndexOf(word, startPosition);
Regex reg = new Regex(word);
html = reg.Replace(html, alterWord(word), 1, startPosition);
}
return html;
}
I'll leave the details of alterWord() to you. :)
Try .SelectNodes("//body//*"). That'll get you all elements within any body element, at any depth.

is it possible to fix the problem in HtmlAgilityPack when there is a not closed html tag?

well i have the following problem.
the html i have is malformed and i have problems with selecting nodes using html agility pack when this is the case.
the code is below:
string strHtml = #"
<html>
<div>
<p><strong>Elem_A</strong>String_A1_2 String_A1_2</p>
<p><strong>Elem_B</strong>String_B1_2 String_B1_2</p>
</div>
<div>
<p><strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas</p>
<p><strong>Elem_B</strong>String_B2_2 String_B2_2</p>
</div>
</html>";
HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
objHtmlDocument.LoadHtml(strHtml);
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
{
lststrText.Add(nodeP.InnerHtml);
}
the problem is that String_A2_2 is enclosed in brackets.
so htmlagility pack returns 5 strings instead of 4 in the lststrText.
so is it possible to let htmlagility pack return element 3 as
"<strong>Elem_A</strong>String_A2_2 <String_A2_2> asdas"?
or maybe i can do some preprocessing to close the element?
the current content of lststrText is
lststrText[0] = "<strong>Elem_A</strong>String_A1_2 String_A1_2"
lststrText[1] = "<strong>Elem_B</strong>String_B1_2 String_B1_2"
lststrText[2] = ""
lststrText[3] = ""
lststrText[4] = "<strong>Elem_B</strong>String_B2_2 String_B2_2"
Most html parsers try to build a working DOM, meaning dangling tags are not accepted. They will be converted, or closed in some way.
If only selecting the nodes is of importance to you, and speed and huge amounts of data is not an issue, you could grab all your <p> tags with a regular expression instead:
Regex reMatchP = new Regex(#"<(p)>.*?</\1>");
foreach (Match m in reMatchP.Matches(strHtml))
{
Console.WriteLine(m.Value);
}
This regular expression assumes the <p> tags are well formed and closed.
If you are to run this Regex a lot in your program you should declare it as:
static Regex reMatchP = new Regex(#"<(p)>.*?</\1>", RegexOptions.Compiled);
[Edit: Agility pack change]
If you want to use HtmlAgility pack you can modify the PushNodeEnd function in HtmlDocument.cs:
if (HtmlNode.IsCDataElement(CurrentNodeName()))
{
_state = ParseState.PcData;
return true;
}
// new code start
if ( !AllowedTags.Contains(_currentnode.Name) )
{
close = true;
}
// new code end
where AllowedTags would be a list of all known tags: b, p, br, span, div, etc.
the output is not 100% what you want, but maybe close enough?
<strong>Elem_A</strong>String_A1_2 String_A1_2
<strong>Elem_B</strong>String_B1_2 String_B1_2
<strong>Elem_A</strong>String_A2_2 <ignorestring_a2_2></ignorestring_a2_2> asdas
<strong>Elem_B</strong>String_B2_2 String_B2_2
You could use TidyNet to do the pre/postprocessing you allude to. Can you edit your answer to explain why that wouldnt be applicable in your case?

Categories

Resources