Html Agility Pack + Get specific node - c#

Hello i have a problem with my application.
I need to pick out a specific text between two nodes.
The html page looks like this
<td align="right" width="186">Text1</td>
<td align="center" width="51">? - ?</td>
<td width="186">Text2</td>`
I can pick out Text1 and Text2 with:
HtmlNodeCollection cols = doc.DocumentNode.SelectNodes("//td[#width='186']");<br />
foreach (HtmlNode col in cols)<br />
{
if (col.InnerText == "Text1")
{
Label1.Text = col.InnerText;
}
}
The reason why i have the if-condition is because there are more td's in the page. And i need to specifically pick out the one who got "Text1" in it.
But the problem is how i can parse out the text "? - ?" There are more text in the document also having the text "? - ?" but i need to pick out specifically the one between my two other nodes..
The result should be Text1 ? - ? Text2 etc..
I guess it has something to do with nextchild or sibling etcetera?

You can check col.NextSibling.InnerText.

Related

Agility Helper Html retrieving p/paragraphs text until another anchor is reached

I am using Agility Helper HTML and I have thus far a code as such:
var linkWeb = new HtmlWeb();
var linkDoc = web.Load(link);
foreach (HtmlNode l in linkDoc.DocumentNode.SelectNodes("//p"))
{
Console.WriteLine("text #"+ i++= + l.InnerText);
}
So this reads the web paragraph text just fine except, I want it to read all the paragraphs text combined until another anchor a tag is reached or if you can think of a better method.
<p>
PART 1
CONTENT1;
CONTENT2;
</p>
<p>CONTENT3.</p>
<p>
PART 2
CONTENT1
CONTENT2
CONTENT3
CONTENT4
</p>
<p>CONTENT5.</p>
<p>CONTENT6.</p>
<p>CONTENT8.</p>
<p>
PART 3
CONTENT1
CONTENT2
CONTENT3
CONTENT4.
</p>
So right now with the code I have, it reads the P text of each paragraph separately.
TEXT #1 is
CONTENT1
CONTENT2
TEXT # 2 is
CONTENT3.
I want this to read
TEXT #1 is
CONTENT1
CONTENT2
CONTENT3.
this is dynamic and # of paragraphs change.
Some kind of check to make sure before hitting the anchor it reads all paragraphs / InnerTexts and knows it is the supposed to be in the same Text #.
You could implement this like:
foreach (HtmlNode l in linkDoc.DocumentNode.SelectNodes("//p"))
{
if (l.ChildNodes.Any(node => node.Name == "a"))
{
Console.WriteLine();
Console.Write("text #" + i++);
}
Console.Write(l.InnerText + " ");
}

How would I Strip Html from a string and set a character limit?

I'm getting a string from a list of items, The string is currently displayed as "item.ItemDescription" (the 9th row below)
I want to strip out all html from this string. And set a character limit of 250 after the html is stripped.
Is there a simple way of doing this?
I saw there was a posts saying to install HTML Agility Pack but I was looking for something simpler.
EDIT:
It does not always contain html, If the client wanted to add a Bold or italic tag to an items name in the description it would show up as <"strong">Item Name<"/strong"> for instance, I want to strip out all html no matter what is entered.
<tbody>
#foreach (var itemin Model.itemList)
{
<tr id="#("__filterItem_" + item.EntityId + "_" + item.EntityTypeId)">
<td>
#Html.ActionLink(item.ItemName, "Details", "Item", new { id = item.EntityId }, null)
</td>
<td>
item.ItemDescription
</td>
<td>
#if (Model.IsOwner)
{
<a class="btnDelete" title="Delete" itemid="#(item.EntityId)" entitytype="#item.EntityTypeId" filterid="#Model.Id">Delete</a>
}
</td>
</tr>
}
</tbody>
Your best option IMO is to night get into a parsing nightmare with all the possible values, why not simply inject a class=someCssClassName into the <td> as an attribute. Then control the length, color whatever with CSS.
An even better idea is to assign a class to the containing <tr class=trClass> and then have the CSS apply lengths to child <td> elements.
You could do something like this to remove all tags (opening, closing, and self-closing) from the string, but it may have the unintended consequence of removing things the user entered that weren't meant to be html tags:
text = Regex.Replace(text, "<\/?[^>]*\/?>", String.Empty);
Instead, I would recommend something like this and letting the user know html isn't supported:
text = text.Replace("<", "<");
text = text.Replace(">", ">");
Just remember to check for your 250 character limit before the conversion:
text = text.Substring(0, 250);
This Regex will select any html tags (including the ones with double quotes such as <"strong">:
<[^>]*>
Look here: http://regexr.com/3cge4
Using C# regular expressions to remove HTML tags
From there, you can simply check the string size and display appropriately.
var itemDescriptionStripped = Regex.Replace(item.ItemDescription, #"<[^>]*>", String.Empty);
if (itemDescriptionStripped.Length >= 250)
itemDescriptionStripped.Substring(0,249);
else
itemDescriptionStripped;

Get colored texts within HTML code

I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

C#: Get exact substring from HTML code using IndexOf and LastIndexOf

I have a HTML page retrieved using the GetResponseStream() in C#. I need an exact value (int) that comes from that page, which is different every time I run the program. Nevertheless, the structure of the HTML code is the same, in particular:
(...)
<td colspan="2" class="txtnormal"><div align="right"> TAX:</div></td>
<td class="txtnormal"><div align="right"><strong>0.00</strong></div></td>
<td colspan="2"> </td>
(...)
and
(...)
<td colspan="2"><div align="right" class="txtnormal">Total:</div></td>
<td class="txtnormal"><div align="right"><strong>10.00</strong></div></td>
<td colspan="2"> </td>
(...)
Notice that the code is repeated in the same page (i.e: <td class="txtnormal"><div align="right"><strong>VALUE</strong></div></td>), but the title of the values (TAX and Total) are the only different thing (the actual value could be the same).
I would like to store in a variable the Total value, this is: 10.0 in this case.
I tried this:
int first = responseFromServer.IndexOf("<td class= \"txtnormal\"><div align=\"right\"><strong>") + "<td class=\"txtnormal\"><div align=\"right\"><strong>".Length;
int last = responseFromServer.LastIndexOf("</strong></div></td>");
string value = responseFromServer.Substring(first, last - first);
But i get bad results, the value stored in value of ALL the HTML page until the value (is for the difference I´m doing).
Do you know how could I get the exact value, this is: the sub-string between the text I pasted?
Thank you very much.
To scrape from a page, you have a couple of options. The "best" is to use the DOM to find the node(s) in question and pull it's value. If you can't use the DOM for some reason, you can move to regex and pull the value that way.
Your method is "okay" in many instances, as long as you can be sure the site owner will never set up another instance of "</strong></div></td>" anywhere downstream. This is a risky assumption.
What value are you getting for the int string? that will tell you whether or not your particular pattern is working correctly. And I would consider the HTML DOM still, as it is a more accurate way to traverse the nodes.
I think Regex is your friend here:
using System;
using System.Text.RegularExpressions;
namespace SimpleApp
{
class Program
{
static void Main(string[] args)
{
Regex theRegex = new Regex(#">Total:<.+?<strong>(.+?)</strong>");
string str = #"<td colspan=""2""><div align=""right"" class=""txtnormal"">Total:</div></td>" +
#"<td class=""txtnormal""><div align=""right""><strong>10.00</strong></div></td>" +
#"<td colspan=""2""> </td>";
if (theRegex.Match(str).Success)
{
Console.WriteLine("Found Total of " + theRegex.Match(str).Result("$1"));
}
else
{
Console.WriteLine("Not found");
}
Console.ReadLine();
}
}
}
Obviously your HTML page might have other things that could trip this simple regular expression up but you get the idea.

Extract content in paragraph Tags

I have following html in string and i have to extract the content only in Paragraph tags any ideas??
link is http://www.public-domain-content.com/books/Coming_Race/C1P1.shtml
I have tried
const string HTML_TAG_PATTERN = "<[^>]+.*?>";
static string StripHTML(string inputString)
{
return Regex.Replace(inputString, HTML_TAG_PATTERN, string.Empty);
}
it removes all html tags but i dont want to remove all the tags because this is the way how i can get content like paragraph by tags
secondly it makes line breaks to \n in text and and applying replace("\n","") dose not helps
one problem is that when i apply
int UrlStart = e.Result.IndexOf("<p>"), urlEnd = e.Result.IndexOf("<p> </p></td>\r" );
string paragraph = e.Result.Substring(UrlStart, urlEnd);
extractedContent.Text = paragraph.Replace(Environment.NewLine, "");
<p> </p></td>\r this appears at the end of paragraph but urlEnd dose not makes sure only paragraph is shown
the string extracted is shown in visual studio is like this
this page is downloaded by Webclient
End of HTMLpage
We will provide ourselves with ropes of\rsuitable length and strength- and- pardon me- you must not\rdrink more to-night. our hands and feet must be steady and\rfirm tomorrow.\"\r<p> </p> </td>\r </tr>\r\r <tr>\r <td height=\"25\" width=\"10%\">\r \r </td><td height=\"25\" width=\"80%\" align=\"center\">\r <font color=\"#FFFFFF\">\r <font size=\"4\">1</font> \r </font></td>\r <td height=\"25\" width=\"10%\" align=\"right\">Next</td>\r </tr>\r </table>\r </center>\r</div>\r<p align=\"center\"><b>The Coming Race -by- Edward Bulwer Lytton</b></p>\r<P><B><center>Encyclopedia - Books - Religion<a/> - <A HREF=\"http://www.public-domain-content.com/links2.shtml\">Links - Home - Message Boards</B><BR>This Wikipedia content is licensed under the <a href=\"http://www.gnu.org/copyleft/fdl.html\">GNU Fr
Don't use regular expressions to parse HTML. Use the HTML Agility Pack (or something similar) instead.
A quick example, but you could do something like this:
HtmlDocument document = new HtmlDocument();
document.Load("your_file_here.htm");
foreach(HtmlNode paragraph in document.DocumentElement.SelectNodes("//p"))
{
// do something with the paragraph node here
string content = paragraph.InnerText; // or something similar
}

Categories

Resources