Parsing HTML to get the key and value - c#

I use HTMLAgility to parse HTML Document.
I downloaded the dll from codeplex and referenced it to my project.
Now, all my need is to parse this HTML (below):
<HTML>
<BODY>
//......................
<tbody ID='image'>
<tr><td>Video Codec</td><td colspan=2>JPEG (8192 KBytes)</td></tr>
</BODY>
Now, I need to retrieve Video Codec and its value JPEG from the above HTML.
I know that I can use HTMLAgility but how to do that?
var document = new HtmlDocument();
string htmlString = "<tbody ID='image'>";
document.LoadHtml(htmlString);
// how to get the Video Codec and its value `JPEG` ?
Any pointers is much appreciated.
EDIT:
I was able to proceed from #itedi 's answer to a bit but still stuck up.
var cells = document.DocumentNode
// use the right XPath rather than looping manually
.SelectNodes(#"//table")
.ToList();
var tbodies = cells.First().SelectNodes(#"//tbody").ToList();
gives me all the tbody's but how to print the values from it ?

A much lighter way would be using regex:
string s = #"<tbody ID='image'>
<tr><td>Video Codec</td><td colspan=2>JPEG (8192 KBytes)</td></tr>
</BODY>";
var results = Regex.Match(s, "<td>Video Codec</td><td.*?>(.+?)</td>").Groups[1];
Returns:
JPEG (8192 KBytes)

Related

c# Html Agility pack getting div and span nodes

This is the html document I am trying to extract the highlighted data in
.
I have read a lot on this site but was unable to find a solution that was helpful.
I tried using
nodes = doc.DocumentNode.SelectNodes(table_title + "/tbody/tr/td");
headers = nodes.Elements("span").Select(d => d.InnerText.Trim());
foreach (var this_header in header)
{
string location = this_header.InnerText.Trim();
Console.Writeline(location);
}
This does not give me the correct information. How do I find the specific content I am looking for?
What is this /tbody/tr/td ... there is no table at all.
you have to get a unique selector (xpath, css, id) at SelectNodes..

Extracting string from Html page using C#

I have a source html page and I want to do the following:
extracting a specific string from the whole html page and save the new choosing string in a new html page.
creating a database on MySQL with 4 columns.
importing the data from the html page to the table on MySql.
I would be pretty thankful and grateful if someone could help me in that cause I have no that perfect knowledge of using C#.
You could use this code :
HttpClient http = new HttpClient();
//I have put Ebay.com. you could use any.
var response = await http.GetByteArrayAsync("ebay.com");
String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
source = WebUtility.HtmlDecode(source);
HtmlDocument Nodes = new HtmlDocument();
Nodes.LoadHtml(source);
In the Nodes object, you will have all the DOM elements in the HTML page.
You could use linq to filter out whatever you need.
Example :
List<HtmlNode> RequiredNodes = Nodes.DocumentNode.Descendants()
.Where(x => x.Attributes["Class"].Contains("List-Item")).ToList();
You will probably need to install Html Agility Pack NuGet or download it from the link.
hope this helps.

HTML parsing from C#

I'm trying to parse some HTML files which don't always have the exact same format. Nevertheless, I've been able to find some patterns which are common to all the files.
For example, this is one of the files:
https://www.sec.gov/Archives/edgar/data/63908/000006390816000103/mcd-12312015x10k.htm#sFBA07EFA89A85B6DB59920A55B5021BC
I've seen that all the files I need have a unique tag which InnerText equals to "Financial Statements and Supplementary Data". I cannot search directly for that string as i appears repeatedly along the text. I used this code to find that tag:
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(m_strFilePath);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
if (link.InnerText.Contains("Financial Statements"))
{
}
}
I was wondering if there's any way to get the position of this tag in the html substring so i can get the data i need by doing:
dataNeeded = html.substring(indexOf<a>Tag);
Thanks a lot

text returning as NULL using htmlagility pack + xpath

I'm currently playing around with htmlagility pack, however, I don't seem to be getting any data back from the following url:
http://cloud.tfl.gov.uk/TrackerNet/LineStatus
This is the code i'm using:
var url = #"http://cloud.tfl.gov.uk/TrackerNet/LineStatus";
var webGet = new HtmlWeb();
var doc = webGet.Load(url);
However, when I check the contents of 'doc', the text value is set to null. I've tried other url's and i'm receiving the HTML used on the site. Is it just this particular url, or am I doing something wrong. Any help would be appreciated.
HtmlAgilityPack is an HTML parser, thus you won't be successful in trying to parse a non-HTML webpage such as the XML your want to parse.

Need help for parsing HTML in C#

For personal use i am trying to parse a little html page that show in a simple grid the result of the french soccer championship.
var Url = "http://www.lfp.fr/mobile/ligue1/resultat.asp?code_jr_tr=J01";
WebResponse result = null;
WebRequest req = WebRequest.Create(Url);
result = req.GetResponse();
Stream ReceiveStream = result.GetResponseStream();
Encoding encode = System.Text.Encoding.GetEncoding(0);
StreamReader sr = new StreamReader(ReceiveStream, encode);
while (sr.Read() != -1)
{
Line = sr.ReadLine();
Line = Regex.Replace(Line, #"<(.|\n)*?>", " ");
Line = Line.Replace(" ", "");
Line = Line.TrimEnd();
Line = Line.TrimStart();
and then i really dont have a clue either take line by line or the
whole stream at one and how to retreive only the team's name with the next number that would be the score.
At the end i want to put both 2 team's with scores in a liste or xml to use it with an phone application
If anyone has an idea it would be great thanks!
Take a look at Html Agility Pack
You could put the stream into an XmlDocument, allowing you to query via something like XPath. Or you could use LINQ to XML with an XDocument.
It's not perfect though, because HTML files aren't always well-formed XML (don't we know it!), but it's a simple solution using stuff already available in the framework.
You'll need an SgmlReader, which provides an XML-like API over any SGML document (which an HTML document really is).
You could use the Regex.Match method to pull out the team name and score. Examine the html to see how each row is built up. This is a common technique in screen scraping.

Categories

Resources