c# crawling rule didn't work to cnn web site - c#

I'm beginner to C# crawling
I have tried to crawl CNN headlines news from (https://edition.cnn.com/)
But I have failed to get the head line texts.
target is looks like below html (sorry I'm not good at asking questions containing source code, newbie T.T)
<div class="cd__wrapper" data-analytics="_list-hierarchical-xs_article_">
<div class="cd__content">
<h3 class="cd__headline" data-analytics="_list-hierarchical-xs_article_">
<a href="/travel/article/cruise-ship-passengers-stranded-coronavirus/index.html">
<span class="cd__headline-text">At least 30 cruise ships are at sea. Here's what it's like on board.</span><span class="cd__headline-icon cnn-icon"></span></a></h3></div></div>
First I tried to crawl to all html codes
then convert to string
(my target is get head line text with href link for crawling child pages)
with below c# codes
public async void GetCnnAsync()
{
var url = "https://edition.cnn.com/";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new Hp.HtmlDocument();
htmlDocument.LoadHtml(html);
var headLineHtmlList = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Contains("cd__headline")).ToList();
Console.ReadLine();
}
but It didn't work just get null headLineHtmlList
I don't know why I failed to get result. because chrome page inspecter source have that elements
On the other hand when I tried it to stackoverflow site.
I was able to get question list with below codes
public async void GetHtmlAsync()
{
var url = "https://stackoverflow.com/questions";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new Hp.HtmlDocument();
htmlDocument.LoadHtml(html);
var questionsHtml = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("id", "")
.Equals("questions")).ToList();
var questionList = questionsHtml[0].Descendants("div")
.Where(node => node.GetAttributeValue("id", "")
.Contains("question-summary")).ToList();
}
It was able to get question list.
Now I really really want to get result from CNN website
please help me
Thanks in advance
add more test codes
create WebBrowser control
then navigate then get WebBrowser_DocumentCompleted callback
but I didn't get result again
so, I tried it again with documentCompleted
but I didn't get it
WebBrowser webBrowser;
Control parent;
WebNewsCallback newsCallback;
public WebNewsCrawler(Control parent, WebNewsCallback newsCallback) {
this.parent = parent;
this.newsCallback = newsCallback;
if (webBrowser == null) {
webBrowser = new WebBrowser {
Visible = false,
ScriptErrorsSuppressed = true
};
}
parent.Controls.Add(webBrowser);
webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
}
public void doWork(string address) {
webBrowser.Navigate(address);
}
int count = 0;
private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) {
if (webBrowser.ReadyState != WebBrowserReadyState.Complete) return;
newsCallback(webBrowser.DocumentStream);
GetCnn(webBrowser.DocumentStream);
Console.WriteLine(count.ToString());
count++;
}
public void GetCnn(Stream stream) {
var doc = new Hp.HtmlDocument();
doc.Load(stream, Encoding.UTF8);
var nodes = doc.DocumentNode.SelectNodes("/html/body/div[7]/section[2]/div[2]/div/div[1]/ul/li[4]/article/div/div/h3/a/span[1]");
if(nodes != null) {
Console.WriteLine("xpath nodes not null");
}
var headLineHtmlList = doc.DocumentNode.Descendants("h3").ToList();
if (headLineHtmlList != null) {
Console.WriteLine("headLineCount " +headLineHtmlList.Count.ToString());
}
}
headLineCount is 0 and xPath result is zero(xpath or xpath full path same result)

Are you sure that the selector you're using is correct? You said:
var headLineHtmlList = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Contains("cd__headline")).ToList();
Isn't that saying "Give me all the descendants with tag <div> and a CSS class of cd__headline"?
But you're not looking for a div with a class of cd__headline. You're looking for <a> tags occurring inside <h3> tags that have a CSS class of cd__headline.
I could be wrong, but if I'm right it would be an easy fix! Good luck.

Related

C# Get data from website and show it in textbox

Hello i am pretty new in c# sphere. I want to make a little program that will fetch data from the given page.
It is a fragment of website:
<h3 class="filmInfo__header cloneToCast cloneToOtherInfo" data-type="directing-header">reżyseria</h3>
<div class="filmInfo__info cloneToCast cloneToOtherInfo" data-type="directing-info"> <span itemprop="url" content="/person/Rupert+Sanders-1121101"></span> <span itemprop="name">Rupert Sanders</span> </div>
I want to get data from "Data-type="Directing-info" and get a result from title="Rupert Sanders"
Somebody can help me ?
My very simple code:
private void button1_Click(object sender, EventArgs e)
{
var url = "https://www.filmweb.pl/film/Kr%C3%B3lewna+%C5%9Anie%C5%BCka+i+%C5%81owca-2012-600541";
var httpClient = new HttpClient();
var html = httpClient.GetStringAsync(url);
textBox1.Text = (html.Result);
}
C# or .NET does not offer native HTML parsing functionality. However, there are a handful of libraries which provides HTML parsing functionality. For example, you can use Html Agility Pack.
First, you need to install it into your project. You can easily install it with NuGet Package Manager in Visual Studio if you use it.
After that, you can use it like this with your input HTML:
private void button1_Click(object sender, EventArgs e)
{
var url = "https://www.filmweb.pl/film/Kr%C3%B3lewna+%C5%9Anie%C5%BCka+i+%C5%81owca-2012-600541";
var httpClient = new HttpClient();
var html = httpClient.GetStringAsync(url);
// Create a HtmlDocument and load your HTML into it.
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html.Result);
// Find your desired node inside it.
HtmlNode directingInfoNode = htmlDocument.DocumentNode.SelectSingleNode("//div[#data-type='directing-info']/a");
// Get the title attribute of that node.
HtmlAttribute titleAttribute = directingInfoNode.Attributes["title"];
textBox1.Text = (titleAttribute.Value);
}
Of course, you will need to put necessary using statement to the top of your file:
using HtmlAgilityPack;

Unable to get html element by using X-Path in HtmlAgilityPack C#

I am trying to get element by using x-path tree element but showing null, and this type of x-path work for other site for me, only 2% site this types of X-Path not working, also i tried x-path from chrome also but when my x-path not work that time chrome x-path also not work.
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/div[2]/table[1]"); // i want this type // not wroking
//var nodetest2 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]/div/div[1]/div[2]/table"); // from Google chrome // not wroking
//var nodetest3 = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"content\"]"); // by ID but i don't want this type // wroking
Console.WriteLine(nodetest1.InnerText); //fail
//Console.WriteLine(nodetest2.InnerText); //fail
//Console.WriteLine(nodetest3.InnerText); //proper but I don't want this type
}
The answer that #QHarr suggested works perfectly, But the reason you get null with a correct x-path, is that there is a javascript file in the header of the site, that adds a wrapper div around the table, and since getting result in HtmlAgilityPack seems not loading or executing js, the x-path returns null.
what you observe, after that js runs is:
<div class="view-content">
<div class="guide-text">
...
</div>
<div class="scroll-table1">
<!-- Your table is here -->
</div>
</div>
but what actually you get whithout that js, is:
<div class="view-content">
<!-- Your table is here -->
</div>
thus your x-path should be:
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("/html[1]/body[1]/section[2]/div[1]/div[1]/div[1]/div[1]/table[1]");
Your xpath when used in browser selects for entire table. You can shorten and use as follows (fiddle):
using System;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
string url = "http://www.ndrf.gov.in/tender";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("//table");
Console.WriteLine(nodetest1.InnerText);
}
}
Use Fizzler.Systems.HtmlAgilityPack
details here : https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
This library adds extension methods called QuerySelector and QuerySelectorAll, that takes CSS Selector not XPath.
Ali Bordbar caught perfect, This Url adds a wrapper div when I navigating URL in WebBrowser control in this all JavaScript file are loaded,
but when i load URL using HtmlWeb there is none of the JavaScript file loaded.
The HtmlWeb retrieves the static HTML response that the server sends, and does not execute any javascript, whereas a WebBrowser would.
So WebBrowser control HTML DOM data XPath and HtmlWeb HTML DOM data XPath not match.
My below code work perfect for this switchvation
HtmlWeb web = new HtmlWeb();
web.AutoDetectEncoding = true;
HtmlAgilityPack.HtmlDocument theDoc1 = web.Load("http://www.ndrf.gov.in/tender");
var HtmlDoc = new HtmlAgilityPack.HtmlDocument();
var bodytag = theDoc1.DocumentNode.SelectSingleNode("//html");
HtmlDoc.LoadHtml(bodytag.OuterHtml);
var xpathHtmldata = HtmlDoc.DocumentNode.SelectSingleNode(savexpath); //savexpath is my first xpath make from HTML DOM data of WebBrowser control which is work for most url.
if (xpathHtmldata == null)
{
//take last tag name from first xpath
string mainele = savexpath.Substring(savexpath.LastIndexOf("/") + 1);
if (mainele.Contains("[")) { mainele = mainele.Remove(mainele.IndexOf("[")); }
//collect all tag name with name of which is sotre in mainele variable
var taglist = HtmlDoc.DocumentNode.SelectNodes("//" + mainele);
foreach (var ele in taglist) //check one by one element
{
string htmltext1 = ele.InnerText;
htmltext1 = Regex.Replace(htmltext1, #"\s", "");
htmltext1 = htmltext1.Replace("&", "&").Trim();
htmltext1 = htmltext1.Replace(" ", "").Trim();
string htmltext2 = saveInnerText; // my previus xpath text from HTML DOM data of WebBrowser control
htmltext2 = Regex.Replace(htmltext2, #"\s", "");
if (htmltext1 == htmltext2) // check equality to my previus xpath text..if it is equal thats my new xpath
{
savexpath = ele.XPath;
break;
}
}
}

C# Downloading Instagram Profile As HTML

I have been trying to download an public Instagram profile to the fetch stats such as followers and bio. I have been doing this in a c# console application and downloading the HTML using HTML Agility Pack.
Code:
string url = #"https://www.instagram.com/" + Console.ReadLine() + #"/?hl=en";
Console.WriteLine();
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(url);
document.Save(path1);
When I save it though all I get is a bunch of scripts and a blank screen:
I was wondering how to save the html once all the scripts had run and formed the content
When you retrieve content using a web request, it returns a HTML document which is then rendered by the browser to display the content.
Right now, you're saving the HTML document given to you by the server. Instead of doing this, you need to render it before getting the details. One way to do this is using a web browser control. If you set the URL to the instragram URL, let the rendering engine handle it and once the load event is fired by the control, you can get the rendered HTML output.
From there, you can deserialize as an XmlDocument and identify exactly what details you need to retrieve from the rendered output.
public MainWindow()
{
InitializeComponent();
WB_1.Navigate(#"https://www.instagram.com/" + Console.ReadLine() + #"/?hl=en");
WB_1.LoadCompleted += wb_LoadCompleted;
}
void wb_LoadCompleted(object sender, NavigationEventArgs e)
{
dynamic doc = WB_1.Document;
string htmlText = doc.documentElement.InnerHtml;
}
ANSWER
Thanks for the suggestions on how to download the HTML! I managed to return some instagram information in the end. Here is the code:
//(This was done using HTML Agility Pack)
string url = #"https://www.instagram.com/" + Console.ReadLine() + #"/?hl=en";
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(url);
var metas = document.DocumentNode.Descendants("meta");
var followers = metas.FirstOrDefault(_ => _.HasProperty("name", "description"));
if (followers == null) { Console.WriteLine("Sorry, Can't Find Profile :("); return; }
var content = followers.Attributes["content"].Value.StopAt('-');
Console.WriteLine(content);
And HasProperty() & StopAt()
public static bool HasProperty(this HtmlNode node, string property, params string[] valueArray)
{
var propertyValue = node.GetAttributeValue(property, "");
var propertyValues = propertyValue.Split(' ');
return valueArray.All(c => propertyValues.Contains(c));
}
public static string StopAt(this string input, char stopAt)
{
int x = input.IndexOf(stopAt);
return input.Substring(0, x);
}
NOTE:
However this is still not the answer I am looking for. I still have a wreck of HTML which is not structred the same as the HTML I recieve when I look at it in Google Chrome. Doing some searching in the HTML I managed to scalvage the content-less html for a meta tag which contains the content. This is okay for this but if I going to continue this method of finding HTML content then it may not be the same :(

How can i loop over a string and get the links between href that end with jpg?

I'm navigating to a web site using webBrowser in the completed event i did:
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
mshtml.HTMLDocument objHtmlDoc = (mshtml.HTMLDocument)webBrowser1.Document.DomDocument;
string pageSource = objHtmlDoc.documentElement.innerHTML;
}
Now in the pageSource i have the whole page source.
I tried to make
string[] lines = File.ReadAllLines(pageSource);
But it give me exception:
Illegal characters in path
Then i tried this line:
var aContents = Regex.Matches(pageSource, #"<a [^>]*>(.*?)</a>").Cast<Match>().Select(m => m.Groups[1].Value);
But i there are no href lines in the aContents
Use htmlagilitypack http://html-agility-pack.net
and you can use the library method to load from url - and then check the node to see if it contains the ext and store it in a collection.
List<string> alljpgHref = new List<string>;
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(/* url */);
foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
if (hrefValue.contains(".jpg")) alljpgHref.add(hrefValue);
}
or just query the links:
string[] hrefs = this.webBrowser1.Document.Links.Cast<HtmlElement>()
.Select(a => a.GetAttribute("href")).Where(h => h.Contains(".jpg")).ToArray();

How to use HTMLAgilityPack to extract HTML data

I am learning to write web crawler and found some great examples to get me started but since I am new to this, I have a few questions in regards to the coding method.
The search result for example can be found here: Search Result
When I look at the HTML source for the result I can see the following:
<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> Status :</B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
How can I use the HTMLAgilityPack to scrap those data from the site?
I was trying to implement an example as shown below, but not sure where to make the edit to get it working to crawl the page:
private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();
if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}
The Crawler helper class:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace CrawlerWeb
{
public class Crawler
{
public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}
tb is a multiline textbox... So I would like it to display the following:
Name WILLIAMS AJAYA L
Address NEW YORK NY
Profession ATHLETIC TRAINER
License No 001475
Date of Licensure 1/12/07
Additional Qualification Not applicable in this profession
Status REGISTERED
Registered through last day of 08/15
I would like the second argument to be added to an array because next step would be to write to a SQL database...
I am able to get the URL from the IE which has the search result but how can I code it in my script?
This little snippet should get you started:
HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
You basically use the WebClient class to download the HTML file and then you load that HTML into the HtmlDocument object. Then you need to use XPath to query the DOM tree and search for nodes. In the above example "nodes" will include all the div elements in the document.
Here's a quick reference about the XPath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx

Categories

Resources