c# using html agility pack URI formats not supported

c# using html agility pack URI formats not supported - c#

I am trying to use HTML agility pack to get my program to read in a file and get all the image srcs from it. Heres what I got so far:
private ArrayList GetImageLinks(String html,String link)
{
//link = url of webpage
//html = a string of the html, just for testing will remove after
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(link);
List<String> imgs = (from x in htmlDoc.DocumentNode.Descendants()
where x.Name.ToLower() == "img"
select x.Attributes["src"].Value).ToList<String>();
Console.Out.WriteLine("Hey");
ArrayList imageLinks = new ArrayList(imgs);
foreach (String element in imageLinks)
{
Console.WriteLine(element);
}
return imageLinks;
}
And this is the error im getting:
System.ArgumentException: URI formats are not supported.

HtmlDocument docHtml = new HtmlWeb().Load(url);

Related

Scrape data with HtmlAgilityPack for a tag which doesn't have class

Here is my C# code what i am trying to do is to scrape data from a website by using HtmlAgilityPack but it's showing nothing found every time don't know what i am doing wrong a bit confused
HtmlAgilityPack.HtmlWeb webb = new HtmlAgilityPack.HtmlWeb();
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;
HtmlAgilityPack.HtmlDocument doc = webb.Load("mywebsite");
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//ul[#class='unstyled']//li//a");
if (nodes != null)
{
foreach (HtmlNode n in nodes)
{
q = n.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
Console.WriteLine(q);
}
}
else
{
Console.WriteLine("nothing found");
}
Here is the picture of the tag from which i am trying to capture data i need data from <a> tag .

The XPath used to select the tag is incorrect.
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
This should select all the anchor nodes and then you can loop through the nodes to get the InnerHtml.
Working sample shown below
string s = "<ul class='unstyle no-overflow'><li><ul class='unstyled'><li><a href='http://www.smsconnexion.com'>SMS ConneXion</a></li></ul><ul class='unstyled'><li><a href='http://www.celusion.com'>Celusion</a></li></ul></li></ul>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
HtmlNodeCollection nodes =
doc.DocumentNode.SelectNodes("//ul[#class='unstyled']/li/a");
foreach(var node in nodes)
{
Console.WriteLine(node.Attributes["href"].Value);
}
Console.ReadLine();

Extract HREF values from HTML string

I am attempting to create a crawler that returns only links from a website and i have it to a point that it returns the HTML script.
I am now wanting to use an if statement to check that the string is returned and if it is returned, it searches for all "< a >" tags and shows me the href link.
but I don't know what object to check or what value I should be checking for.
Here is what I have so far:
namespace crawler
{
class Program
{
static void Main(string[] args)
{
System.Net.WebClient wc = new System.Net.WebClient();
string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
Console.WriteLine(WebData);
// if
}
}
}

You can have a look at HTML Agility Pack:
Then you can find all links from a web page like:
var hrefs = new List<string>();
var hw = new HtmlWeb();
HtmlDocument document = hw.Load(/* your url here */);
foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute attribute = link.Attributes["href"];
if (!string.IsNullOrWhiteSpace(attribute.Value))
hrefs.Add(attribute.Value);
}

Firstly you can make a function to return the whole website HTML code as you have done. Here is the one I have!
public string GetPageContents()
{
string link = "https://www.abc.net.au/news/science/"
string pageContent = "";
WebClient web = new WebClient();
Stream stream;
stream = web.OpenRead(link);
using (StreamReader reader = new StreamReader(stream))
{
pageContent = reader.ReadToEnd();
}
stream.Close();
return pageContents;
}
Then you could make a function that would return a substring or a List of substring (meaning that if you wanted all < a > tags you would probably get more than one).
List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")
This would give you a list where you could, for example, make another search for < a > tags inside each of those < div > tags.
public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
Regex rx = new Regex(startTag + "(.*?)" + endTag);
MatchCollection col = rx.Matches(value);
List<string> tags = new List<string>();
foreach(Match s in col)
tags.Add(s.ToString());
return tags;
}
Edit: Wow didn't know of HTML Agility Pack, thanks #Gauravsa i'll update my project to use it!

how to get data from exact chtml class agility pack

I want to extract not the whole web-page but only text from one class, I want text from td class="result-neutral" and I don't know what is wrong with this code:
<td class="result-neutral" xseid="xz1nBfht">3 - 2 </td>
And this is C# code:
HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
HtmlWeb hw = new HtmlWeb();
doc = hw.Load("htt
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d =>d.Attributes.Contains("class")&&d.Attributes["class"].Value.Contains("result-neutral"));
foreach (var item in scoreNodes)
{
result += item.OuterHtml + Environment.NewLine;
}
Info.Text = result;
}

The OuterHtml returns html with start & end of the element. Don't you want InnerHtml or InnerText?
EDIT:
This snippet works for me:
const string html = #"<html><body><table><tr><td class='result-neutral' xseid='xz1nBfht'><a href='/hockey/russia/khl/ska-st-petersburg-metallurg-magnitogorsk-xz1nBfht/'>3 - 2</a></td></tr></table></body></html>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var scoreNodes = doc.DocumentNode.Descendants("td").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("result-neutral"));
string result = "";
foreach (var item in scoreNodes) {
result += item.InnerText + Environment.NewLine;
}
result = result.TrimEnd(); // the result is "3-2"

How to select specific node using Html Agility Pack?

<div class="form-field wide-80 normal">1997-09-15</div>
I am trying to select the date inside it 1997-09-15. I tried this code but its giving an error of "Xpath Exception was Unhandled" what's wrong in the code please Help
string Url = "http://whois.domaintools.com/google.com";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
var SpanNodes = doc.DocumentNode.SelectNodes("//div[#class=form-field wide-80 normal]");
if (SpanNodes != null)
{
foreach (HtmlNode SN in SpanNodes)
{
string text = SN.FirstChild.InnerText.Trim();
MessageBox.Show(text);
}
}

You forget 's
var SpanNodes =
doc.DocumentNode.SelectNodes("//div[#class='form-field wide-80 normal']");

how to use html agility pack to extract all url from html text

Often I extract file names from html text data using regex but I heard the html agility pack is good for parsing html data. how can I use html agility pack to extract all url from html data. Can any one guide me with sample code. Thanks.
This is my code sample which works fine.
using System.Text.RegularExpressions;
private ArrayList GetFilesName(string Source)
{
ArrayList arrayList = new ArrayList();
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
if (!match.get_Value().StartsWith("http://"))
{
arrayList.Add(Path.GetFileName(match.get_Value()));
}
match.NextMatch();
}
ArrayList arrayList1 = arrayList;
return arrayList1;
}
private string ReplaceSrc(string Source)
{
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
string value = match.get_Value();
string str = string.Concat("images/", Path.GetFileName(value));
Source = Source.Replace(value, str);
match.NextMatch();
}
string source = Source;
return source;
}

Something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var images = doc.DocumentNode.Descendants("img")
.Where(i => i.GetAttributeValue("src", null) != null)
.Select(i => i.Attributes["src"].Value);
This selects all the <img> elements from the document which have src property set, and return these URLs.

Select all img tags with non-empty src attribute (otherwise you will get NullReferenceException during getting attribute value):
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[#src!='']")
.Select(i => i.Attributes["src"].Value);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# using html agility pack URI formats not supported - c#

HtmlDocument docHtml = new HtmlWeb().Load(url);

Related

Scrape data with HtmlAgilityPack for a tag which doesn't have class

Extract HREF values from HTML string

how to get data from exact chtml class agility pack

How to select specific node using Html Agility Pack?

how to use html agility pack to extract all url from html text

Categories

Resources