Getting value from string using specific conditions - c#

I have an html data in my string in which i need to get only paragraph values.Below is a sample html.
<html>
<head>
<title>
<script>
<div>
Some contents
</div>
<div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
<div>
Other html elements
</div>
So how to get the data from the paragraphs using string manipulation.
Desired Output
<Div>
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>

Give the div an ID, e.g.
<div id="test">
<p> This is what i want </p>
<p> Select all data from p </p>
<p> Upto this is required </p>
</div>
then use //div[#id='test']/p.
The solution broken down:
//div - All div elements
[#id='test'] - With an ID attribute whose value is test
/p

I have used Html agility Pack for something like this. Then you can use LINQ to get what you want.

Xpath is the obvious answer (if the HTML is decent, has a root etc), failing that some third party widget like chilkat

If you use Html Agility Pack as mentioned in the other posts, you can get all paragraph elements in the html by using:
HtmlDocument doc = new HtmlDocument();
doc.Load("your html string");
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p")
Since you are using .net Framework 2.0, you would want an older version of Agility Pack, which can be found here: HTML Agility Pack
If you want just the text inside the paragraph, you can use
var pNodes = doc.DocumentNode.SelectNodes("//div[#id='id of the div']/p/text()")

Related

How to find and replace strings containing invalid HTML tags to valid tags

I have a string containing a list of html tags with invalid tag formatting.
For example, I have a string such as that below:
<p>
<strong>Scale:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
 <li>2 to 4 nodes</li>
</ul>
</p>
<p>
<strong>Single Node Data:</strong>
</p>
<p>
<ul style="list-style-type:disc" class="pl-2">
 <li>CPU: 6-26 cores (Intel)</li>
 <li>RAM: 128GB to 2TB</li>
 <li>Raw storage: 240GB to 16TB</li>
 <li>Storage type: SSD + HDD</li>
 <li>Network speed: Up to 25Gb</li>
</ul>
</p><img src="xxxxx"/>
I need to replace the tags ending with /> to </img>, such that <img src="xxxxx"/> would be replaced with <img src="xxxxx"></img>.
How would I achieve this using C#?
For what you are asking, you can go with either one of the following options
Option 1
You can use a 3rd party library that parses your HTML into tags (it actually renders it as XML) and separate each tag (and its content) in a string array/list
then you loop the list and check if the closing tag is proper, if not replace it with the proper one.
Here is the library
Option 2
You can create your own html parser, which would give you more control over the parser's logic, i found this example of C# HTML parser on CodeProject you can check it out.

How to get parent node with ancestors xpath in html agility pack

How to get parent node with ancestors XPath in HTML document in HTML Agility Pack (HAP)? For example, I have one HTML document please check below:
<html>
<body>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<a>
<h3>
</h3>
</a>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
I need to get the parent node and their ancestors path in HAP. For example, the XPath of the above HTML document is
/html/body/div/div[1]/div[2]/div[1]/div/div/div/div/a/h3
Expect xpath will be
/html/body/div/div[1]/div[2]/div[1]
Note that I need to get the expected Xpath dynamically - not as a manually hardcode value. For example, based on the last element I need to get the parent with ancestors path.

Retrieving specific URLs with HtmlAgilityPack C#

I'm currently attempting to use HtmlAgilityPack to extract specific links from an html page. I tried using plain C# to force my way in but that turned out to be a real pain. The links are all inside of <div> tags that all have the same class. Here's what I have:
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load(url);
//this should select only the <div> tags with the class acTrigger
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']"))
{
//not sure how to dig further in to get the href values from each of the <a> tags
}
and the sites code looks along the lines of this
<li>
<div class="acTrigger">
<a href="/16014988/d/" onclick="return queueRefinementAnalytics('Category','Battery')">
Battery <em> (1)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/15568540/d/" onclick="return queueRefinementAnalytics('Category','Brakes')">
Brakes <em> (2)</em>
</a>
</div>
</li>
<li>
<div class="acTrigger">
<a href="/11436914/d/1979-honda-ct90-cables-lines" onclick="return queueRefinementAnalytics('Category','Cables/Lines')">
Cables/Lines <em> (1)</em>
</a>
</div>
</li>
There are a lot of links on this page, but the href I need are contained inside of those <a> tags which are nested inside of the <div class="acTrigger"> tags. It would be simple if each <a> shared unique classes, but unfortunately only the <div> tags have classes. What I need to do is grab each one of those hrefs and store them so I can retrieve them later, go to each page, and retrieve more information from each page. I just need a nudge in the right direction to get over this hump, then I should be able to do the other pages as well. I have no previous experience with this HtmlAgilityPack and all the example I find seem to want to extract all the URLs from a page, not specific ones. I just need a link to an example or documentation, any help is greatly appreciated.
You should be able to change your select to include the <a> tag: //div[#class='acTrigger']/a. That way your HtmlNode is your <a> tag instead of the div.
To store the links you can use GetAttributeValue.
foreach (HtmlNode node in html.DocumentNode.SelectNodes("//div[#class='acTrigger']/a"))
{
// Get the value of the HREF attribute.
string hrefValue = node.GetAttributeValue( "href", string.Empty );
// Then store hrefValue for later.
}

Parse a div with HTML Agility Pack

I've this HTML code:
div class="singolo-contenuto link_azure">
<p><img src="" class="left pad2 field_foto" alt="" /><p> Message </p>
</div>
I need to "capture" "Message".
I'm trying with:
String message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']").InnerText;
but doesn't works... I obtain a lot of the full page... what's wrong?
The XPath expression you have just gets you to the <div> tag. You need to get deeper into the last <p> tag. This will work:
var message = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='singolo-contenuto link_azure']//p[last()]").InnerText;

HTML agility pack get all divs with class

I am trying to scape a complicated HTMl. I need to get some text from div's with certain class.
What I am trying to do is have the html agility pack to go over the whole html and find all divs whos class contains "listevent" and return me those.
When I searched online I found out that If I map it , it is possible, but some of these divs are under somemany divs so trying to find some easy way.
The HTML looks like this
<div>
<div>
<table>
<tr>
<td>
<div class="thisone listevent"></td>
<td>
<div class="thisone listevent"></td>
</tr>
</table>
</div>
</div>
You could use SelectNodes method
foreach(HtmlNode div in document.DocumentNode.SelectNodes("//div[contains(#class,'listevent')]"))
{
}
If you are more familiar with css style selectors, try fizzler and do this
document.DocumentNode.QuerySelectorAll("div.listevent");

Categories

Resources