Extract heading text from HTML text - c#

I have a textarea with tinyMCE text editor to make it RichTextEditor. I want to extract all heading(H1,H2 etc) text without style and formatting .
Suppose that txtEditor.InnerText gives me value like below:
<p><span style="font-family: comic sans ms,sans-serif; color: #993366; font-size: large; background-color: #33cccc;">This is before heading one</span></p>
<h1><span style="font-family: comic sans ms,sans-serif; color: #993366;">Hello This is Headone</span></h1>
<p>this is before heading2</p>
<h2>This is heading2</h2>
i want to get a list of heading tag's text only ? any kind of suggestion and guidance will be appreciated.

Use HtmlAgilityPack, and then it's easy :
var doc = new HtmlDocument();
doc.LoadHtml(txtEditor.InnerText);
var h1Elements = doc.DocumentNode.Descendants("h1").Select(nd => nd.InnerText);
string h1Text = string.Join(" ", h1Elements);

referencing Regular Expression to Read Tags in HTML
I believe that this is close to what you are looking for:
String h1Regex = "<h[1-5][^>]*?>(?<TagText>.*?)</h[1-5]>";
MatchCollection mc = Regex.Matches(html, h1Regex);

Related

How would I Strip Html from a string and set a character limit?

I'm getting a string from a list of items, The string is currently displayed as "item.ItemDescription" (the 9th row below)
I want to strip out all html from this string. And set a character limit of 250 after the html is stripped.
Is there a simple way of doing this?
I saw there was a posts saying to install HTML Agility Pack but I was looking for something simpler.
EDIT:
It does not always contain html, If the client wanted to add a Bold or italic tag to an items name in the description it would show up as <"strong">Item Name<"/strong"> for instance, I want to strip out all html no matter what is entered.
<tbody>
#foreach (var itemin Model.itemList)
{
<tr id="#("__filterItem_" + item.EntityId + "_" + item.EntityTypeId)">
<td>
#Html.ActionLink(item.ItemName, "Details", "Item", new { id = item.EntityId }, null)
</td>
<td>
item.ItemDescription
</td>
<td>
#if (Model.IsOwner)
{
<a class="btnDelete" title="Delete" itemid="#(item.EntityId)" entitytype="#item.EntityTypeId" filterid="#Model.Id">Delete</a>
}
</td>
</tr>
}
</tbody>
Your best option IMO is to night get into a parsing nightmare with all the possible values, why not simply inject a class=someCssClassName into the <td> as an attribute. Then control the length, color whatever with CSS.
An even better idea is to assign a class to the containing <tr class=trClass> and then have the CSS apply lengths to child <td> elements.
You could do something like this to remove all tags (opening, closing, and self-closing) from the string, but it may have the unintended consequence of removing things the user entered that weren't meant to be html tags:
text = Regex.Replace(text, "<\/?[^>]*\/?>", String.Empty);
Instead, I would recommend something like this and letting the user know html isn't supported:
text = text.Replace("<", "<");
text = text.Replace(">", ">");
Just remember to check for your 250 character limit before the conversion:
text = text.Substring(0, 250);
This Regex will select any html tags (including the ones with double quotes such as <"strong">:
<[^>]*>
Look here: http://regexr.com/3cge4
Using C# regular expressions to remove HTML tags
From there, you can simply check the string size and display appropriately.
var itemDescriptionStripped = Regex.Replace(item.ItemDescription, #"<[^>]*>", String.Empty);
if (itemDescriptionStripped.Length >= 250)
itemDescriptionStripped.Substring(0,249);
else
itemDescriptionStripped;

how to split the string between two strings in c#?

I have one String variable that contains HTML data.Now i want to split that html string into multiple string and then finally merge those strings into single one.
This is html string:
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</strong></span></p>
<p style="text-align: center;"><strong><span style="color: #008000;">para2</span> स्द्स्द्सद्स्द para2 again<br /></strong></p>
<p style="text-align: left;"><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
And this is my expected output:
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</strong></span><strong><span style="color: #008000;">para2</span>para2 again<br /></strong><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
My Split Logic is given below...
Split the HTML string into token based on </p> tag.
And take the first token and store it in separate string variable(firstPara).
Now take the each and every token and then remove any tag starting with<p and also ending with </p>.And store each value in separate variable.
4.Then take first token named firstPara and replace the tag </p> and then append each every token that we got through the step 3.
5.So,Now the variable firstPara has whole value...
Finally, we just append </p> at the end of the firstPara...
This is my problem...
Could you please step me to get out of this issue...
Here is regex example how to do it.
String pattern = #"(?<=<p.*>).*(?=</p>)";
var matches = Regex.Matches(text, pattern);
StringBuilder result = new StringBuilder();
result.Append("<p>");
foreach (Match match in matches)
{
result.Append(match.Value);
}
result.Append("</p>");
And this is how you should do it with Html Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodes = doc.DocumentNode.SelectNodes("//p");
StringBuilder result = new StringBuilder();
result.Append("<p>");
foreach (HtmlNode node in nodes)
{
result.Append(node.InnerHtml);
}
result.Append("</p>");
If you would like to split a string by another string, you may use string.Split(string[] separator, StringSplitOptions options) where separator is a string array which contains at least one string that will be used to split the string
Example
//Initialize a string of name HTML as our HTML code
string HTML = "<p><span style=\"text-decoration: underline; color: #ff0000;\"><strong>para1</strong></span></p> <p style=\"text-align: center;\"><strong><span style=\"color: #008000;\">para2</span> स्द्स्द्सद्स्द para2 again<br /></strong></p> <p style=\"text-align: left;\"><strong><span style=\"color: #0000ff;\">para3</span><br /></strong></p>";
//Initialize a string array of name strSplit to split HTML with </p>
string[] strSplit = HTML.Split(new string[] { "</p>" }, StringSplitOptions.None);
//Initialize a string of name expectedOutput
string expectedOutput = "";
string stringToAppend = "";
//Initialize i as an int. Continue if i is less than strSplit.Length. Increment i by 1 each time you continue
for (int i = 0; i < strSplit.Length; i++)
{
if (i >= 1) //Continue if the index is greater or equal to 1; from the second item to the last item
{
stringToAppend = strSplit[i].Replace("<p", "<"); //Replace <p by <
}
else //Otherwise
{
stringToAppend = strSplit[i]; //Don't change anything in the string
}
//Append strSplit[i] to expectedOutput
expectedOutput += stringToAppend;
}
//Append </p> at the end of the string
expectedOutput += "</p>";
//Write the output to the Console
Console.WriteLine(expectedOutput);
Console.Read();
Output
<p><span style="text-decoration: underline; color: #ff0000;"><strong>para1</stro
ng></span> < style="text-align: center;"><strong><span style="color: #008000;">p
ara2</span> ?????????????? para2 again<br /></strong> < style="text-align: left;
"><strong><span style="color: #0000ff;">para3</span><br /></strong></p>
NOTICE: Because my program does not support Unicode characters, it could not read स्द्स्द्सद्स्द. Thus, it was translated as ??????????????.
Thanks,
I hope you find this helpful :)

Get colored texts within HTML code

I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

Html Agility Pack + Get specific node

Hello i have a problem with my application.
I need to pick out a specific text between two nodes.
The html page looks like this
<td align="right" width="186">Text1</td>
<td align="center" width="51">? - ?</td>
<td width="186">Text2</td>`
I can pick out Text1 and Text2 with:
HtmlNodeCollection cols = doc.DocumentNode.SelectNodes("//td[#width='186']");<br />
foreach (HtmlNode col in cols)<br />
{
if (col.InnerText == "Text1")
{
Label1.Text = col.InnerText;
}
}
The reason why i have the if-condition is because there are more td's in the page. And i need to specifically pick out the one who got "Text1" in it.
But the problem is how i can parse out the text "? - ?" There are more text in the document also having the text "? - ?" but i need to pick out specifically the one between my two other nodes..
The result should be Text1 ? - ? Text2 etc..
I guess it has something to do with nextchild or sibling etcetera?
You can check col.NextSibling.InnerText.

Extract content in paragraph Tags

I have following html in string and i have to extract the content only in Paragraph tags any ideas??
link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml
I have tried
const string HTML_TAG_PATTERN = "<[^>]+.*?>";
static string StripHTML(string inputString)
{
return Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);
}
it removes all html tags but i dont want to remove all the tags because this is the way how i can get content like paragraph by tags
secondly it makes line breaks to \n in text and and applying replace("\n","") dose not helps
one problem is that when i apply
int UrlStart = e.Result.IndexOf("<p>"), urlEnd = e.Result.IndexOf("<p> </p></td>\r" );
string paragraph = e.Result.Substring(UrlStart, urlEnd);
extractedContent.Text = paragraph.Replace(Environment.NewLine, "");
<p> </p></td>\r this appears at the end of paragraph but urlEnd dose not makes sure only paragraph is shown
the string extracted is shown in visual studio is like this
this page is downloaded by Webclient
End of HTMLpage
We will provide ourselves with ropes of\rsuitable length and strength- and- pardon me- you must not\rdrink more to-night. our hands and feet must be steady and\rfirm tomorrow.\"\r<p> </p> </td>\r </tr>\r\r <tr>\r <td height=\"25\" width=\"10%\">\r \r </td><td height=\"25\" width=\"80%\" align=\"center\">\r <font color=\"#FFFFFF\">\r <font size=\"4\">1</font> \r </font></td>\r <td height=\"25\" width=\"10%\" align=\"right\">Next</td>\r </tr>\r </table>\r </center>\r</div>\r<p align=\"center\"><b>The Coming Race -by- Edward Bulwer Lytton</b></p>\r<P><B><center>Encyclopedia - Books - Religion<a/> - <A HREF=\"http://www.public-domain-content.com/links2.shtml\">Links - Home - Message Boards</B><BR>This Wikipedia content is licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">GNU Fr
Don't use regular expressions to parse HTML. Use the HTML Agility Pack (or something similar) instead.
A quick example, but you could do something like this:
HtmlDocument document = new HtmlDocument();
document.Load("your_file_here.htm");
foreach(HtmlNode paragraph in document.DocumentElement.SelectNodes("//p"))
{
// do something with the paragraph node here
string content = paragraph.InnerText; // or something similar
}

Categories

Resources