C# web scraper navigate to aspx link - c#

I have a C# Windows Phone 8.1 app which I am building. Part of the app needs to go and look for information on a specific web page. One of the fields which I need is a URL which can be found on certain items on the page, however I am finding that the URL is in a relative-style format
FullArticle.aspx?a=323495
I am wondering if there is a way in C# using HtmlAgilityPack, HttpWebRequest etc etc to find the link to the actual page. Code snippet is below.
private static TileUpdate processSingleNewsItem(HtmlNode newsItemNode)
{
Debug.WriteLine("");
var articleImage = getArticleImage(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
var articleDate = getArticleDate(getNode(newsItemNode, "div", "nw-container-panel-articledate"));
var articleSummary = getArticleSummary(getNode(newsItemNode, "div", "nw-container-panel-textarea"));
var articleUrl = getArticleUrl(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
return new TileUpdate{
Date = articleDate,
Headline = articleSummary,
ImagePath = articleImage,
Url = articleUrl
};
}
private static string getArticleUrl(HtmlNode parentNode)
{
var imageNode = parentNode.Descendants("a").FirstOrDefault();
Debug.WriteLine(imageNode.GetAttributeValue("href", null));
return imageNode.GetAttributeValue("href", null);
}
private static HtmlNode getNode(HtmlNode parentNode, string nodeType, string className)
{
var children = parentNode.Elements(nodeType).Where(o => o.Attributes["class"].Value == className);
return children.First();
}
Would appreciate any ideas or solutions. Cheers!

In my web crawler here's what I do:
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(#"//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
if (att == null) continue;
string href = att.Value;
if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue; // ignore javascript on buttons using a tags
Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
// Make it absolute if it's relative
if (!urlNext.IsAbsoluteUri)
{
urlNext = new Uri(urlRoot, urlNext);
}
...
}

Related

HtmlDocument get a incomplete page

I am doing a small web scraping project, and I am having a problem with the function that takes the html code. The web that I inspect in the browser is different from the web that downloads the method (for the same URL).
I have tried to improve the coding process, but to no avail. The same thing happens for "i=2".
static void Main(string[] args)
{
string prefixurl = "https://www.aaabbbcccdddeee.de/en/do-business-with-finland/finnish-suppliers/finnish-suppliers-results?query=africa";
for (int i = 1; i < 18; i++)
{
string url = prefixurl;
if (i > 1)
{
url = prefixurl + "&page=" + i;
}
var doc = GetDocument(url);
var links = GetBusinessLinks(url);
List<Empresa> empresas = GetBusiness(links);
Export(empresas);
}
}
static List<string> GetBusinessLinks(string url)
{
var doc = GetDocument(url);
var linkNodes = doc.DocumentNode.SelectNodes("/html/body/section/div/div/div/div[2]/div[2]//a");
// //a[#class=\"btn bf-ghost-button\"]
var baseUri= new Uri(url);
var links = new List<string>();
//The problem its there, in the incomplete page the program haven't found nodes
foreach (var node in linkNodes)
{
var link = node.Attributes["href"].Value;
bool business = link.Contains("companies");
if (business)
{
link = new Uri(baseUri, link).AbsoluteUri;
links.Add(link);
}
}
return links;
}
static HtmlDocument GetDocument(string url)
{
var web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument()
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc = web.Load(url);
return doc;
}
ยดยดยด
Your suggestion has made me suspect where I should continue looking, thanks.
I have used PupperSharp in non-headless mode.
https://betterprogramming.pub/web-scraping-using-c-and-net-d99a085dace2

HtmlAgilityPack modify html and return updated content

I have not used the HtmlAgilityPack often and I'm stuck on the following issue.
I'm checking to see if the browser supports WebP, if yes I then append a new parameter to the src of the image.
I have that working, but I cannot work out how to return the updated HTML, any help will be appreciated.
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string imgOuterHtml = string.Empty;
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
image.SetAttributeValue("src", src[1] + string.Format(webP));
imgOuterHtml = image.OuterHtml;
}
//How do I return the updated html here
return new HtmlString(h.ParsedText);
}
Ok, I could not find anything that was built into the agility pack to do what I wanted.
I have managed to achieve what I was after using the code below
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string modifiedHtml = htmlText.ToString();
List<ReplaceImageValues> images = new List<ReplaceImageValues>();
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
string oldSrcValue = image.OuterHtml;
image.SetAttributeValue("src", src[0] + src[1] + string.Format(webP));
string newSrcValue = image.OuterHtml;
images.Add(new ReplaceImageValues(oldSrcValue,newSrcValue));
}
foreach (var newImages in images)
{
modifiedHtml = modifiedHtml.Replace(newImages.OldVal, newImages.NewVal);
}
return new HtmlString(modifiedHtml);
}

C# Webscraper to grab amount of Google Results given a specific search term

I've been working on a webscraper as a Windows Forms application in C#. The user enter a search term and the term and the program will then split the search string for each individual words and look up the amount of search results through Yahoo and Google.
My issue lies with the orientation of the huge HTML document. I've tried multiple approaches such as
iterating recursively and comparing ids aswell as with lamba and the Where statements. Both results in null. I also manually looked into the html document to make sure the id of the div I want exist in the document.
The id I'm looking for is "resultStats" but it is suuuuuper nested. My code looks like this:
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace WebScraper2._0
{
public class Webscraper
{
private string Google = "http://google.com/#q=";
private string Yahoo = "http://search.yahoo.com/search?p=";
private HtmlWeb web = new HtmlWeb();
private HtmlDocument GoogleDoc = new HtmlDocument();
private HtmlDocument YahooDoc = new HtmlDocument();
public Webscraper()
{
Console.WriteLine("Init");
}
public int WebScrape(string searchterms)
{
//Console.WriteLine(searchterms);
string[] ssize = searchterms.Split(new char[0]);
int YahooMatches = 0;
int GoogleMatches = 0;
foreach (var term in ssize)
{
//Console.WriteLine(term);
var y = web.Load(Yahoo + term);
var g = web.Load(Google + term + "&cad=h");
YahooMatches += YahooFilter(y);
GoogleMatches += GoogleFilter(g);
}
Console.WriteLine("Yahoo found " + YahooMatches.ToString() + " matches");
Console.WriteLine("Google found " + GoogleMatches.ToString() + " matches");
return YahooMatches + GoogleMatches;
}
//Parse to get correct info
public int YahooFilter(HtmlDocument doc)
{
//Look for node with correct ID
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("mw-jump-link"));
foreach (var item in nodes)
{
// displaying final output
Console.WriteLine(item.InnerText);
}
//TODO: Return search resultamount.
return 0;
}
int testCounter = 0;
string toReturn = "";
bool foundMatch = false;
//Parse to get correct info
public int GoogleFilter(HtmlDocument doc)
{
if (doc == null)
{
Console.WriteLine("Null");
}
foreach (var node in doc.DocumentNode.ChildNodes)
{
toReturn += Looper(node, testCounter, toReturn, foundMatch);
}
Console.WriteLine(toReturn);
/*
var stuff = doc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("id", "")
.Equals("extabar")).ToList();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where(n => n.HasClass("appbar"));
*/
return 0;
}
public string Looper(HtmlNode node, int counter, string returnstring, bool foundMatch)
{
Console.WriteLine("Loop started" + counter.ToString());
counter++;
Console.WriteLine(node.Id);
if (node.Id == "resultStats")
{
returnstring += node.InnerText;
}
foreach (HtmlNode n in node.Descendants())
{
Looper(n, counter, returnstring, foundMatch);
}
return returnstring;
}
}
}
I made an google HTML Scraper a few weeks ago, a few things to consider
First: Google don't like when you try to Scrape their Search HTML, while i was running a list of companies trying to get their addresses and phone number, Google block my IP from accessing their website for a little bit (Which cause a hilarious panic in the office)
Second: Google will change the HTML (Id names and etc) of the page so using ID's won't work, on my case i used the combination of HTML Tags and specific information to parse the response and extract the information that i wanted.
Third: It's better to just use their API to grab the information you need, just make sure you respect their free tier query limit and you should be golden.
Here is the Code i used.
public static string getBetween(string strSource, string strStart, string strEnd)
{
int Start, End;
if (strSource.Contains(strStart) && strSource.Contains(strEnd))
{
Start = strSource.IndexOf(strStart, 0) + strStart.Length;
End = strSource.IndexOf(strEnd, Start);
return strSource.Substring(Start, End - Start);
}
else
{
return "";
}
}
public void SearchResult()
{
//Run a Google Search
string uriString = "http://www.google.com/search";
string keywordString = "Search String";
WebClient webClient = new WebClient();
NameValueCollection nameValueCollection = new NameValueCollection();
nameValueCollection.Add("q", keywordString);
webClient.QueryString.Add(nameValueCollection);
string result = webClient.DownloadString(uriString);
string search = getBetween(result, "Address", "Hours");
rtbHtml.Text = getBetween(search, "\">", "<");
}
On my case i used the String Address and Hours to limit what information i wanted to extract.
Edit: Fixed the Logic and added the Code i used.
Edit2: forgot to add the GetBetween Class. (sorry it's my first Answer)

Scraping HTML DOM elements using HtmlAgilityPack in ASP.NET

I am Scraping HTML DOM elements using HtmlAgilityPack in ASP.NET. currently my code is loading all the href links which means that sublinks of sublinks also . But I need only the depending URL of my domain URL. I don't know how to write code for it. Can any one help me to do this?
Here is my code:
public void GetURL(string strGetURL)
{
var getHtmlSource = new HtmlWeb();
var document = new HtmlDocument();
try
{
document = getHtmlSource.Load(strGetURL);
var aTags = document.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
outputurl.Text = string.Empty;
int _count = 0;
foreach (var aTag in aTags)
{
string strURLTmp;
strURLTmp = aTag.Attributes["href"].Value;
if (_count != 0)
{
if (!CheckDuplicate(strURLTmp))
{
lstResults.Add(strURLTmp);
outputurl.Text += strURLTmp + "\n";
counter++;
GetURL(strURLTmp);
}
}
_count++;
}
}
}
If you meant to get URL that contains specific domain, you can change the XPath to be :
//a[contains(#href, 'your domain here')]
Or if you prefer LINQ than XPath :
var aTags = document.DocumentNode.SelectNodes("//a");
if (aTags != null)
{
....
var relevantLinks = aTags.Where(o => o.GetAttributeValue("href", "")
.Contains("your domain here")
);
....
}
GetAttributeValue() is a better way to get value of an attribute using HAP. Instead of returning null which may cause exception, this method returns the 2nd parameter when the attribute is not found in the context node.

C# Web Crawler/Parser/Spider [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm new in a C# and WinForms I want to create a web crawler (parser) - which can parse a web pages and showing them hierarchically. + I don't know how to make bot crawling with a specific hyper-link depth.
So I think I have 2 questions:
How to make bot crawling with specified link depth?
How to show all hyperlinks hierarchically?
P.S. I would be great if it'll be a code samples.
P.P.S. have 1 button = button1; and 1 richtextbox = richTextBox1;
Here is my code: I know it's very ugly.... (all code in a one button):
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
//Declaration
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
Match m;
string anotherTest = #"(((ht){1}tp[s]?://)[-a-zA-Z0-9#:%_\+.~#?&\\]+)";
List<string> savedUrls = new List<string>();
List<string> titles = new List<string>();
//Go to this URL:
string url = UrlTextBox.Text = "http://www.yahoo.com";
if (!(url.StartsWith("http://") || url.StartsWith("https://")))
url = "http://" + url;
//Scrape Whole Html code:
string s = sr.ReadToEnd();
try
{
// Get Urls:
m = Regex.Match(s, anotherTest,
RegexOptions.IgnoreCase | RegexOptions.Compiled,
TimeSpan.FromSeconds(1));
while (m.Success)
{
savedUrls.Add(m.Groups[1].ToString());
m = m.NextMatch();
}
// Get TITLES:
Match m2 = Regex.Match(s, #"<title>\s*(.+?)\s*</title>");
if (m2.Success)
{
titles.Add(m2.Groups[1].Value);
}
//Show Title:
richTextBox1.Text += titles[0] + "\n";
//Show Urls:
TrimUrls(ref savedUrls);
}
catch (RegexMatchTimeoutException)
{
Console.WriteLine("The matching operation timed out.");
}
sr.Close();
}
private void TrimUrls(ref List<string> urls)
{
List<string> d = urls.Distinct().ToList();
foreach (var v in d)
{
if (v.IndexOf('.') != -1 && v != "http://www.w3.org")
{
richTextBox1.Text += v + "\n";
}
}
}
}
}
And one more question:
Is Anybody know how to save it in XML like a tree?
I would also highly recommend you the HTML Agility Pack.
With the Html Agility Pack you can do something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var urls = new List<String>();
doc.DocumentNode.SelectNodes("//a").ForEach(x =>
{
urls.Add(x.Attributes["href"].Value);
});
Edit:
You can do something like this, but please add some exception handling to it.
public class ParsResult
{
public ParsResult Parent { get; set; }
public String Url { get; set; }
public Int32 Depth { get; set; }
}
__
private readonly List<ParsResult> _results = new List<ParsResult>();
private Int32 _maxDepth = 5;
public void Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
if (depth >= _maxDepth) return;
String html;
using (var wc = new WebClient())
html = wc.DownloadString(urlToCheck ?? parent.Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var aNods = doc.DocumentNode.SelectNodes("//a");
if (aNods == null || !aNods.Any()) return;
foreach (var aNode in aNods)
{
var url = aNode.Attributes["href"];
if (url == null)
continue;
var result = new ParsResult
{
Depth = depth,
Parent = parent,
Url = url.Value
};
_results.Add(result);
Console.WriteLine("{0} - {1}", depth, result.Url);
Foo(depth: depth + 1, parent: result);
}
If you need parse such structured data (xhtml), try to look at xpath: http://msdn.microsoft.com/en-us/library/ms256086.aspx
(You should also put your logic in to dedicated objects, not just let it be in GUI layer. You will appreciate it later.)

Categories

Resources