HtmlDocument get a incomplete page - c#

I am doing a small web scraping project, and I am having a problem with the function that takes the html code. The web that I inspect in the browser is different from the web that downloads the method (for the same URL).
I have tried to improve the coding process, but to no avail. The same thing happens for "i=2".
static void Main(string[] args)
{
string prefixurl = "https://www.aaabbbcccdddeee.de/en/do-business-with-finland/finnish-suppliers/finnish-suppliers-results?query=africa";
for (int i = 1; i < 18; i++)
{
string url = prefixurl;
if (i > 1)
{
url = prefixurl + "&page=" + i;
}
var doc = GetDocument(url);
var links = GetBusinessLinks(url);
List<Empresa> empresas = GetBusiness(links);
Export(empresas);
}
}
static List<string> GetBusinessLinks(string url)
{
var doc = GetDocument(url);
var linkNodes = doc.DocumentNode.SelectNodes("/html/body/section/div/div/div/div[2]/div[2]//a");
// //a[#class=\"btn bf-ghost-button\"]
var baseUri= new Uri(url);
var links = new List<string>();
//The problem its there, in the incomplete page the program haven't found nodes
foreach (var node in linkNodes)
{
var link = node.Attributes["href"].Value;
bool business = link.Contains("companies");
if (business)
{
link = new Uri(baseUri, link).AbsoluteUri;
links.Add(link);
}
}
return links;
}
static HtmlDocument GetDocument(string url)
{
var web = new HtmlWeb();
HtmlDocument doc = new HtmlDocument()
{
OptionDefaultStreamEncoding = Encoding.UTF8
};
doc = web.Load(url);
return doc;
}
ยดยดยด

Your suggestion has made me suspect where I should continue looking, thanks.
I have used PupperSharp in non-headless mode.
https://betterprogramming.pub/web-scraping-using-c-and-net-d99a085dace2

Related

Extract email address from a website for each link inside DOM of page

I Want to develope an app I give Url of a specific website to it,and it extract all links from that Web page. For each extracted link I want to get the HTML content. I am based in the concept of deep crawling.
My purpose is to get all email addresses of website. Below is my source code:
static string ExtractEmails(string data)
{
//instantiate with this pattern
Regex emailRegex = new Regex(#"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*", RegexOptions.IgnoreCase);
//find items that matches with our pattern
MatchCollection emailMatches = emailRegex.Matches(data);
//StringBuilder sb = new StringBuilder();
string s = "";
foreach (Match emailMatch in emailMatches)
{
//sb.AppendLine(emailMatch.Value);
s += emailMatch.Value + ",";
}
return s;
}
static readonly List<ParsResult> _results = new List<ParsResult>();
static Int32 _maxDepth = 4;
static String Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
string email = "";
if (depth >= _maxDepth) return email;
String html;
using (var wc = new WebClient())
html = wc.DownloadString(urlToCheck ?? parent.Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var aNods = doc.DocumentNode.SelectNodes("//a");
if (aNods == null || !aNods.Any()) return email;
foreach (var aNode in aNods)
{
var url = aNode.Attributes["href"];
if (url == null)
continue;
var wc2 = new WebClient();
String html2 = wc2.DownloadString(url.Value);
email = ExtractEmails(html2);
Console.WriteLine(email);
var result = new ParsResult
{
Depth = depth,
Parent = parent,
Url = url.Value
};
_results.Add(result);
Console.WriteLine("{0} - {1}", depth, result.Url);
Foo(depth: depth + 1, parent: result);
return email;
}
return email;
}
static void Main(string[] args)
{
String res = Foo("http://www.mobileridoda.com", 0);
Console.WriteLine("emails " + res);
}
I want to dispaly in console all emails extracted by all pages of all links that are inside DOM of Main page, But it dispalys no emails in console. How can I solve this issue ?
Thank you
Found a few things wrong but no worries, got the details on why and what to do to fix them.
In your foreach loop, when you go through the first URL, you are using a return statement at the end essentially breaking the loop and terminating. Use return only after you have processed ALL the URLs and accumulated the email addresses.
You are overwriting the email (i see it as a csv) when you go over the loop. Use += to continue adding. email = ExtractEmails(html2);
You are not returning anything when you call Foo within your forEach loop. You need to use email += Foo(xyz). Foo(depth: depth + 1, parent: result);
You are going through a URL that you have already processed... possibly causing an infinite cycle. I added a list of strings that keeps track of URLs you have already visited so as to prevent the infinite loop you might get into.
Here is a complete working solution.
static string ExtractEmails(string data)
{
//instantiate with this pattern
Regex emailRegex = new Regex(#"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*", RegexOptions.IgnoreCase);
//find items that matches with our pattern
MatchCollection emailMatches = emailRegex.Matches(data);
//StringBuilder sb = new StringBuilder();
string s = "";
foreach (Match emailMatch in emailMatches)
{
//sb.AppendLine(emailMatch.Value);
s += emailMatch.Value + ",";
}
return s;
}
static readonly List<ParsResult> _results = new List<ParsResult>();
static Int32 _maxDepth = 4;
static List<string> urlsAlreadyVisited = new List<string>();
static String Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
if (urlsAlreadyVisited.Contains(urlToCheck))
return string.Empty;
else
urlsAlreadyVisited.Add(urlToCheck);
string email = "";
if (depth >= _maxDepth) return email;
String html;
using (var wc = new WebClient())
html = wc.DownloadString(urlToCheck ?? parent.Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var aNods = doc.DocumentNode.SelectNodes("//a");
if (aNods == null || !aNods.Any()) return email;
// Get Distinct URLs from all the URls on this page.
List<string> allUrls = aNods.ToList().Select(x => x.Attributes["href"].Value).Where(url => url.StartsWith("http")).Distinct().ToList();
foreach (string url in allUrls)
{
var wc2 = new WebClient();
try
{
email += ExtractEmails(wc2.DownloadString(url));
}
catch { /* Swallow Exception ... URL not found or other errors. */ continue; }
Console.WriteLine(email);
var result = new ParsResult
{
Depth = depth,
Parent = parent,
Url = url
};
_results.Add(result);
Console.WriteLine("{0} - {1}", depth, result.Url);
email += Foo(depth: depth + 1, parent: result);
}
return email;
}
public class ParsResult
{
public int Depth { get; set; }
public ParsResult Parent { get; set; }
public string Url { get; set; }
}
// ========== MAIN CLASS ==========
static void Main(string[] args)
{
String res = Foo("http://www.mobileridoda.com", 0);
Console.WriteLine("emails " + res);
}

HtmlAgilityPack modify html and return updated content

I have not used the HtmlAgilityPack often and I'm stuck on the following issue.
I'm checking to see if the browser supports WebP, if yes I then append a new parameter to the src of the image.
I have that working, but I cannot work out how to return the updated HTML, any help will be appreciated.
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string imgOuterHtml = string.Empty;
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
image.SetAttributeValue("src", src[1] + string.Format(webP));
imgOuterHtml = image.OuterHtml;
}
//How do I return the updated html here
return new HtmlString(h.ParsedText);
}
Ok, I could not find anything that was built into the agility pack to do what I wanted.
I have managed to achieve what I was after using the code below
public static HtmlString AppendWebPString(HtmlString htmlText)
{
bool browserSupportsWebP = BrowserSupportsWebPHelper.WebPSupported();
if (!browserSupportsWebP) return htmlText;
var h = new HtmlDocument();
h.LoadHtml(htmlText.ToString());
const string webP = "&quality=80&format=webp";
if (h.DocumentNode.SelectNodes("//img[#src]") == null) return htmlText;
string modifiedHtml = htmlText.ToString();
List<ReplaceImageValues> images = new List<ReplaceImageValues>();
foreach (HtmlNode image in h.DocumentNode.SelectNodes("//img[#src]"))
{
var src = image.Attributes["src"].Value.Split('&');
string oldSrcValue = image.OuterHtml;
image.SetAttributeValue("src", src[0] + src[1] + string.Format(webP));
string newSrcValue = image.OuterHtml;
images.Add(new ReplaceImageValues(oldSrcValue,newSrcValue));
}
foreach (var newImages in images)
{
modifiedHtml = modifiedHtml.Replace(newImages.OldVal, newImages.NewVal);
}
return new HtmlString(modifiedHtml);
}

html agility pack url scraping-- getting full html link

Hi I am using html agility pack from the nuget packages in order to scrape a web page to get all of the urls on the page. The code is shown below. However the way it returns to me in the output the links are just extensions of the actual website but not the full url link like http://www.foo/bar/foobar.com. All I will get is "/foobar". Is there a way to get the full links of the url with the code below?
Thanks!
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string email)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(email);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(href);
}
return list.ToList();
}
You can check the HREF value if it's relative URL or absolute.
Load the link into a Uri and test whether it is relative If it relative convert it to absolute will be the way to go.
static void Main(string[] args)
{
List<string> linksToVisit = ParseLinks("https://www.facebook.com");
}
public static List<string> ParseLinks(string urlToCrawl)
{
WebClient webClient = new WebClient();
byte[] data = webClient.DownloadData(urlToCrawl);
string download = Encoding.ASCII.GetString(data);
HashSet<string> list = new HashSet<string>();
var doc = new HtmlDocument();
doc.LoadHtml(download);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var n in nodes)
{
string href = n.Attributes["href"].Value;
list.Add(GetAbsoluteUrlString(urlToCrawl, href));
}
return list.ToList();
}
Function to convert Relative URL to Absolute
static string GetAbsoluteUrlString(string baseUrl, string url)
{
var uri = new Uri(url, UriKind.RelativeOrAbsolute);
if (!uri.IsAbsoluteUri)
uri = new Uri(new Uri(baseUrl), uri);
return uri.ToString();
}
You can't get the complete url because in the href attribute there isn't the complete url. Example:
In your case the page contains relative urls. You need to do this:
string href = email + n.Attributes["href"].Value;
In this way you will have the full url. The better solution is to check if url is relative or absolute and, if the url is relative, add email at the beginning of the url otherwise no.

C# web scraper navigate to aspx link

I have a C# Windows Phone 8.1 app which I am building. Part of the app needs to go and look for information on a specific web page. One of the fields which I need is a URL which can be found on certain items on the page, however I am finding that the URL is in a relative-style format
FullArticle.aspx?a=323495
I am wondering if there is a way in C# using HtmlAgilityPack, HttpWebRequest etc etc to find the link to the actual page. Code snippet is below.
private static TileUpdate processSingleNewsItem(HtmlNode newsItemNode)
{
Debug.WriteLine("");
var articleImage = getArticleImage(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
var articleDate = getArticleDate(getNode(newsItemNode, "div", "nw-container-panel-articledate"));
var articleSummary = getArticleSummary(getNode(newsItemNode, "div", "nw-container-panel-textarea"));
var articleUrl = getArticleUrl(getNode(newsItemNode, "div", "nw-container-panel-articleimage"));
return new TileUpdate{
Date = articleDate,
Headline = articleSummary,
ImagePath = articleImage,
Url = articleUrl
};
}
private static string getArticleUrl(HtmlNode parentNode)
{
var imageNode = parentNode.Descendants("a").FirstOrDefault();
Debug.WriteLine(imageNode.GetAttributeValue("href", null));
return imageNode.GetAttributeValue("href", null);
}
private static HtmlNode getNode(HtmlNode parentNode, string nodeType, string className)
{
var children = parentNode.Elements(nodeType).Where(o => o.Attributes["class"].Value == className);
return children.First();
}
Would appreciate any ideas or solutions. Cheers!
In my web crawler here's what I do:
foreach (HtmlNode link in doc.DocumentNode.SelectNodes(#"//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
if (att == null) continue;
string href = att.Value;
if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue; // ignore javascript on buttons using a tags
Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
// Make it absolute if it's relative
if (!urlNext.IsAbsoluteUri)
{
urlNext = new Uri(urlRoot, urlNext);
}
...
}

C# Web Crawler/Parser/Spider [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm new in a C# and WinForms I want to create a web crawler (parser) - which can parse a web pages and showing them hierarchically. + I don't know how to make bot crawling with a specific hyper-link depth.
So I think I have 2 questions:
How to make bot crawling with specified link depth?
How to show all hyperlinks hierarchically?
P.S. I would be great if it'll be a code samples.
P.P.S. have 1 button = button1; and 1 richtextbox = richTextBox1;
Here is my code: I know it's very ugly.... (all code in a one button):
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
//Declaration
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
Match m;
string anotherTest = #"(((ht){1}tp[s]?://)[-a-zA-Z0-9#:%_\+.~#?&\\]+)";
List<string> savedUrls = new List<string>();
List<string> titles = new List<string>();
//Go to this URL:
string url = UrlTextBox.Text = "http://www.yahoo.com";
if (!(url.StartsWith("http://") || url.StartsWith("https://")))
url = "http://" + url;
//Scrape Whole Html code:
string s = sr.ReadToEnd();
try
{
// Get Urls:
m = Regex.Match(s, anotherTest,
RegexOptions.IgnoreCase | RegexOptions.Compiled,
TimeSpan.FromSeconds(1));
while (m.Success)
{
savedUrls.Add(m.Groups[1].ToString());
m = m.NextMatch();
}
// Get TITLES:
Match m2 = Regex.Match(s, #"<title>\s*(.+?)\s*</title>");
if (m2.Success)
{
titles.Add(m2.Groups[1].Value);
}
//Show Title:
richTextBox1.Text += titles[0] + "\n";
//Show Urls:
TrimUrls(ref savedUrls);
}
catch (RegexMatchTimeoutException)
{
Console.WriteLine("The matching operation timed out.");
}
sr.Close();
}
private void TrimUrls(ref List<string> urls)
{
List<string> d = urls.Distinct().ToList();
foreach (var v in d)
{
if (v.IndexOf('.') != -1 && v != "http://www.w3.org")
{
richTextBox1.Text += v + "\n";
}
}
}
}
}
And one more question:
Is Anybody know how to save it in XML like a tree?
I would also highly recommend you the HTML Agility Pack.
With the Html Agility Pack you can do something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var urls = new List<String>();
doc.DocumentNode.SelectNodes("//a").ForEach(x =>
{
urls.Add(x.Attributes["href"].Value);
});
Edit:
You can do something like this, but please add some exception handling to it.
public class ParsResult
{
public ParsResult Parent { get; set; }
public String Url { get; set; }
public Int32 Depth { get; set; }
}
__
private readonly List<ParsResult> _results = new List<ParsResult>();
private Int32 _maxDepth = 5;
public void Foo(String urlToCheck = null, Int32 depth = 0, ParsResult parent = null)
{
if (depth >= _maxDepth) return;
String html;
using (var wc = new WebClient())
html = wc.DownloadString(urlToCheck ?? parent.Url);
var doc = new HtmlDocument();
doc.LoadHtml(html);
var aNods = doc.DocumentNode.SelectNodes("//a");
if (aNods == null || !aNods.Any()) return;
foreach (var aNode in aNods)
{
var url = aNode.Attributes["href"];
if (url == null)
continue;
var result = new ParsResult
{
Depth = depth,
Parent = parent,
Url = url.Value
};
_results.Add(result);
Console.WriteLine("{0} - {1}", depth, result.Url);
Foo(depth: depth + 1, parent: result);
}
If you need parse such structured data (xhtml), try to look at xpath: http://msdn.microsoft.com/en-us/library/ms256086.aspx
(You should also put your logic in to dedicated objects, not just let it be in GUI layer. You will appreciate it later.)

Categories

Resources