HTML AGILITY PACK Parsing Div Blocks

HTML AGILITY PACK Parsing Div Blocks - c#

I need to parse items from an internet-shop—I need their name and price. Each item-block is located in a different div within a div-catalog of these items.
So I tried this, and it kinda works, but I would prefer to parse both name and price in 1 loop. How might I do so? Thanks!
var url = "http://bestaqua.com.ua/catalog/filtry-obratnogo-osmosa";
HtmlWeb web = new HtmlWeb();
HtmlDocument HtmlDoc = web.Load(url);
var RootNode = HtmlDoc.DocumentNode;
foreach (HtmlNode node in
HtmlDoc.DocumentNode.SelectNodes("//div[#class='catalog_blocks']"))
{
foreach (HtmlNode item_name in
node.SelectNodes("//div[#class='catalog_blocks-item-name']"))
{
string name = item_name.InnerText;
System.Diagnostics.Debug.Write("NAME :" + name + "\n" );
}
foreach (HtmlNode item_price in
node.SelectNodes("//span[#class='price-new']"))
{
string price = item_price.InnerText;
System.Diagnostics.Debug.Write("PRICE: " + price + "\n");
}
}

Since SelectNodes is using an XPATH-expression, you could just use a union in your class filter using "|", which will result in a single collection to loop over.
Note that you would then still need to check which element you've actually selected within the for-loop.

Related

scrape text from web with agility

Having real trouble locating the text from this website with the node
I've tried all sorts or xPaths inside the selectnodes brackets
does anyone have any ideas?
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.LoadFromBrowser("https://app.box.com/s/v2l2cd1mwhemijbigv88nyfk592rjei0");
HtmlNode[] nodes = doc.DocumentNode.SelectNodes("//#*[starts-with(local-name(),'bcpr9')]").ToArray();
foreach (HtmlNode item in nodes)
{
textBox1.Text = item.InnerText;
}

Your code will only put the text from the last node into the text box as you are overwriting it each loop of the for loop. Try this:
textBox1.Text += item.InnerText;

I'm using htmlagilitypack to extract some data from a website but I can't figure out what issue happen?

string Url = "https://www.rottentomatoes.com/browse/dvd-all/?services=netflix_iw";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlWeb().Load(Url);
foreach ( var node in htmlDoc.DocumentNode.SelectNodes("/html/body[#class='body ']/div[#class='body_main container']/div[#id='main_container']/div[#id='main-row']/div[#id='content-column']/div[#id='movies-collection']/div[#class='mb-movies list-view']/div[#class='mb-movie']"))
{
string movieTitle = node.InnerText;
richTextBox1.Text += movieTitle + System.Environment.NewLine;
}
I want to extract all movies title from this URL navigating XPath. VS says that I have no object reference. Why? Can you try for me in this particulary case?

The following piece of code worked for me:
string Url = "https://www.rottentomatoes.com/browse/dvd-all/?services=netflix_iw";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlWeb().Load(Url);
IEnumerable<string> movieTitles = from movieNode in htmlDoc.DocumentNode.Descendants()
where movieNode.GetAttributeValue("class", "").Equals("movieTitle")
select movieNode.InnerHtml;
It uses LINQ to access the nodes containing the movie title.

Get all divs under a div with known ID and iterate over it

I'm building an app that crawls OkCupid matches. Their match result contains Html that looks like this.
<div id="match_results">
<div>person1</div>
<div>person2</div>
<div>person3</div>
</div>
I want to do a foreach person's div inside the div match_results. However, something's not quite right with my C# code. matchesList only contains one element (itself? and not all the divs inside it...)
HtmlDocument matchesHtmlDoc = new HtmlDocument();
matchesHtmlDoc.LoadHtml(matches);
string matchResultDivId = "match_results";
// match results
HtmlNodeCollection matchesList = matchesHtmlDoc.DocumentNode.SelectNodes("//div[#id = '" + matchResultDivId + "']");
foreach (HtmlNode match in matchesList)
{
//test
Console.WriteLine(match.ToString());
}

You forgot to select child divs:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(matches);
string matchResultDivId = "match_results";
string xpath = String.Format("//div[#id='{0}']/div", matchResultDivId);
var people = doc.DocumentNode.SelectNodes(xpath).Select(p => p.InnerText);
foreach(var person in people)
Console.WriteLine(person);
Output:
person1
person2
person3

Concatenate XML Node data

I have XML look like this
<BoxResult>
<DocumentType>BCN</DocumentType>
<DocumentID>BCN_20131113_1197005001#854#11XEZPADAHANDELC</DocumentID>
<DocumentVersion>1</DocumentVersion>
<ebXMLMessageId>CENTRAL_MATCHING</ebXMLMessageId>
<State>FAILED</State>
<Timestamp>2013-11-13T13:02:57</Timestamp>
<Reason>
<ReasonCode>efet:IDNotFound</ReasonCode>
<ReasonText>Unknown Sender</ReasonText>
</Reason>
<Reason>
<ReasonCode>efet:IDNotFound</ReasonCode>
<ReasonText>Unknown Receiver</ReasonText>
</Reason>
</BoxResult>
In my C# code , i need to parse through the XML and concatenate the Reason Text Data.
Basically , i need the output as Unknown Sender ; Unknown Receiver
I tried the following code but i am not getting the desired output
XmlNodeList ReasonNodeList = xmlDoc.SelectNodes(/BoxResult/Reason);
foreach (XmlNode xmln in ReasonNodeList)
{
ReasonText = ReasonText + ";" + xmlDoc.SelectSingleNode(/BoxResult/Reason/ReasonText).InnerXml.ToString();
}
if (ReasonText != " ")
{
ReasonText = ReasonText.Substring(1);
}
The output i am getting from this code is Unknown Sender ; Unknown Sender
It is not displaying Unknown Receiver
Please advise and your help will be useful

You are always using the same node to retrieve the data. The xmlDoc is always called (i.e. the first <Reason> node), instead of each targeted node.
XmlNodeList ReasonNodeList = xmlDoc.SelectNodes("/BoxResult/Reason/ReasonText"); //change here
foreach (XmlNode xmln in ReasonNodeList)
{
ReasonText = ReasonText + ";" + xmln.InnerXml.ToString(); //change here
}
if (ReasonText != " ")
{
ReasonText = ReasonText.Substring(1);
}

You're iterating through <Reason> nodes and each time selecting the first /BoxResult/Reason/ReasonText node in document (note you're not using your xmln variable anywhere).
By the way, here's a shorter version (replaces your whole code block):
ReasonText += String.Join(";",
xmlDoc.SelectNodes("/BoxResult/Reason/ReasonText")
.Cast<XmlNode>()
.Select(n => n.InnerText));

Parsing XML for Converting Time Text Markup to WebVTT

I working on a web application that can take in a subtitle file in either Time Text Markup(TTML) or WebVTT format. If the file is Timed Text, I want to translate it to WebVTT. This is mostly not an issue, the one problem I'm having is that if the TTML has HTML as part of the text content, then the HTML tags get dropped.
For example:
<p begin="00:00:08.18" dur="00:00:03.86">(Music<br />playing)</p>
results in:
(Musicplaying)
The code I use is:
private const string TIME_FORMAT = "hh\\:mm\\:ss\\.fff";
XmlDocument xmldoc = new XmlDocument();
xmldoc.Load(fileLocation);
XDocument xdoc = xmldoc.ToXDocument();
var ns = (from x in xdoc.Root.DescendantsAndSelf()
select x.Name.Namespace).First();
List<TTMLElement> elements =
(
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = item.Value,
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
duration = TimeSpan.Parse(item.Attribute("dur").Value),
}
).ToList<TTMLElement>();
StringBuilder sb = new StringBuilder();
sb.AppendLine("WEBVTT");
sb.AppendLine();
for (int i = 0; i < elements.Count; i++)
{
sb.AppendLine(i.ToString());
sb.AppendLine(elements[i].startTime.ToString(TIME_FORMAT) + " --> " + elements[i].startTime.Add(elements[i].duration).ToString(TIME_FORMAT));
sb.AppendLine(elements[i].text);
sb.AppendLine();
}
Any thoughts on what I'm missing or if there is just a better way of doing this or even if there is already a solution for converting Time Text to WebVTT would be appreciated. Thanks.

I finally came back to this project and I also found a solution to my problem.
First in this section:
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = item,
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
endTime = item.Attribute("dur") != null ?
TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
TimeSpan.Parse(item.Attribute("end").Value)
}
item is of type XElement so an XmlReader object can be created from it resulting in this function:
private static string ReadInnerXML(XElement parent)
{
var reader = parent.CreateReader();
reader.MoveToContent();
var innerText = reader.ReadInnerXml();
return innerText;
}
For my purposes of removing the html inside the node I modified the function to look like this:
private static string ReadInnerXML(XElement parent)
{
var reader = parent.CreateReader();
reader.MoveToContent();
var innerText = reader.ReadInnerXml();
innerText = Regex.Replace(innerText, "<.+?>", " ");
return innerText;
}
Finally resulting in the above lambda looking like this:
from item in xdoc.Descendants(ns + "body").Descendants(ns + "div").Descendants(ns + "p")
select new TTMLElement
{
text = ReadInnerXML(item),
startTime = TimeSpan.Parse(item.Attribute("begin").Value),
endTime = item.Attribute("dur") != null ?
TimeSpan.Parse(item.Attribute("begin").Value).Add(TimeSpan.Parse(item.Attribute("dur").Value)) :
TimeSpan.Parse(item.Attribute("end").Value)
}

Microsoft has a tool which generates both formats:
HTML5 Video Caption Maker
This demo allows you to create simple video caption files. Start by loading a video in a format your browser can play. Then alternately play and pause the video, entering a caption for each segment.
If you have a saved WebVTT or TTML caption file for your video, you may load it, edit the text of existing segments, or append new segments.
If you want to do it programmatically, answers to other questions may help.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML AGILITY PACK Parsing Div Blocks - c#

Since SelectNodes is using an XPATH-expression, you could just use a union in your class filter using "|", which will result in a single collection to loop over. Note that you would then still need to check which element you've actually selected within the for-loop.

Related

scrape text from web with agility

I'm using htmlagilitypack to extract some data from a website but I can't figure out what issue happen?

Get all divs under a div with known ID and iterate over it

Concatenate XML Node data

Parsing XML for Converting Time Text Markup to WebVTT

Categories

Resources