Extract HREF values from HTML string - c#

I am attempting to create a crawler that returns only links from a website and i have it to a point that it returns the HTML script.
I am now wanting to use an if statement to check that the string is returned and if it is returned, it searches for all "< a >" tags and shows me the href link.
but I don't know what object to check or what value I should be checking for.
Here is what I have so far:
namespace crawler
{
class Program
{
static void Main(string[] args)
{
System.Net.WebClient wc = new System.Net.WebClient();
string WebData wc.DownloadString("https://www.abc.net.au/news/science/");
Console.WriteLine(WebData);
// if
}
}
}

You can have a look at HTML Agility Pack:
Then you can find all links from a web page like:
var hrefs = new List<string>();
var hw = new HtmlWeb();
HtmlDocument document = hw.Load(/* your url here */);
foreach(HtmlNode link in document.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute attribute = link.Attributes["href"];
if (!string.IsNullOrWhiteSpace(attribute.Value))
hrefs.Add(attribute.Value);
}

Firstly you can make a function to return the whole website HTML code as you have done. Here is the one I have!
public string GetPageContents()
{
string link = "https://www.abc.net.au/news/science/"
string pageContent = "";
WebClient web = new WebClient();
Stream stream;
stream = web.OpenRead(link);
using (StreamReader reader = new StreamReader(stream))
{
pageContent = reader.ReadToEnd();
}
stream.Close();
return pageContents;
}
Then you could make a function that would return a substring or a List of substring (meaning that if you wanted all < a > tags you would probably get more than one).
List<string> divTags = GetBetweenTags(pageContents, "<div>", "</div>")
This would give you a list where you could, for example, make another search for < a > tags inside each of those < div > tags.
public List<string> GetBetweenTags(string pageContents, string startTag, string endTag)
{
Regex rx = new Regex(startTag + "(.*?)" + endTag);
MatchCollection col = rx.Matches(value);
List<string> tags = new List<string>();
foreach(Match s in col)
tags.Add(s.ToString());
return tags;
}
Edit: Wow didn't know of HTML Agility Pack, thanks #Gauravsa i'll update my project to use it!

Related

Reading Specific text from a website

I am trying to make a database, but i need to get info from a website. Mainly the Title, Date, Length and Genre from the IMDB website. I have tried like 50 different things and it is just not working.
Here is my code.
public string GetName(string URL)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(URL);
var Attr = doc.DocumentNode.SelectNodes("//*[#id=\"overview - top\"]/h1/span[1]#itemprop")[0];
return Name;
}
When I run this it just gives me a XPathException. I just want it to return the Title of a movie. I am now just using this movie for a example and testing but, I want it to work with all movies http://www.imdb.com/title/tt0405422
I am using the HtmlAgilityPack.
The last bit of your XPath is not valid. Also to get only single element from HtmlDocument() you can use SelectSingleNode() instead of SelectNodes() :
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt0405422/");
var xpath = "//*[#id='overview-top']/h1/span[#class='itemprop']";
var span = doc.DocumentNode.SelectSingleNode(xpath);
var title = span.InnerText;
Console.WriteLine(title);
output :
The 40-Year-Old Virgin
demo link : *
https://dotnetfiddle.net/P7U5A7
*) the demo shows that the correct title is printed, along with an error specific to .NET Fiddle (you can safely ignore the error).
I making something familiar and this is my code which gets info from imdb.com website.:
string html = getUrlData(imdbUrl + "combined");
Id = match(#"<link rel=""canonical"" href=""http://www.imdb.com/title/(tt\d{7})/combined"" />", html);
if (!string.IsNullOrEmpty(Id))
{
status = true;
Title = match(#"<title>(IMDb \- )*(.*?) \(.*?</title>", html, 2);
OriginalTitle = match(#"title-extra"">(.*?)<", html);
Year = match(#"<title>.*?\(.*?(\d{4}).*?\).*?</title>", html);
Rating = match(#"<b>(\d.\d)/10</b>", html);
Genres = matchAll(#"<a.*?>(.*?)</a>", match(#"Genre.?:(.*?)(</div>|See more)", html));
Directors = matchAll(#"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(#"Directed by</a></h5>(.*?)</table>", html));
Cast = matchAll(#"<td class=""nm""><a.*?href=""/name/.*?/"".*?>(.*?)</a>", match(#"<h3>Cast</h3>(.*?)</table>", html));
Plot = match(#"Plot:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html);
Runtime = match(#"Runtime:</h5><div class=""info-content"">(\d{1,4}) min[\s]*.*?</div>", html);
Languages = matchAll(#"<a.*?>(.*?)</a>", match(#"Language.?:(.*?)(</div>|>.?and )", html));
Countries = matchAll(#"<a.*?>(.*?)</a>", match(#"Country:(.*?)(</div>|>.?and )", html));
Poster = match(#"<div class=""photo"">.*?<a name=""poster"".*?><img.*?src=""(.*?)"".*?</div>", html);
if (!string.IsNullOrEmpty(Poster) && Poster.IndexOf("media-imdb.com") > 0)
{
Poster = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY200.jpg");
PosterLarge = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY500.jpg");
PosterFull = Regex.Replace(Poster, #"_V1.*?.jpg", "_V1._SY0.jpg");
}
else
{
Poster = string.Empty;
PosterLarge = string.Empty;
PosterFull = string.Empty;
}
ImdbURL = "http://www.imdb.com/title/" + Id + "/";
if (GetExtraInfo)
{
string plotHtml = getUrlData(imdbUrl + "plotsummary");
}
//Match single instance
private string match(string regex, string html, int i = 1)
{
return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
}
//Match all instances and return as ArrayList
private ArrayList matchAll(string regex, string html, int i = 1)
{
ArrayList list = new ArrayList();
foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
list.Add(m.Groups[i].Value.Trim());
return list;
}
Maybe you will find something useful

Windows Form app find Link on Web

I need to create a method that find the newest version of application on a website (Hudson server) and allow to download it.
till now I use regex to scan all the HTML and find the href tags and search for the string I wish to.
I want to know if there is a simplest way to do so.
I attached the code I use today:
namespace SDKGui
{
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href;
}
}
static class LinkFinder
{
public static string Find(string file)
{
string t=null;
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
if (t.Contains("hms_sdk_tool_"))
{
i.Text = t;
list.Add(i);
break;
}
}
return t;
}
}
}
It is easy to collect all href values and filter against any of your conditions using HtmlAgilityPack. The following method shows how to access a page, get all <a> tags, and return a list of all href values containing hms_sdk_tool_:
private List<string> HtmlAgilityCollectHrefs(string url)
{
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(url);
var a_nodes = doc.DocumentNode.SelectNodes("//a");
return a_nodes.Select(p => p.GetAttributeValue("href", "")).Where(n => n.Contains("hms_sdk_tool_")).ToList();
}
Or, if you are interested in 1 return string, use
private string GetLink(string url)
{
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(url);
var a_nodes = doc.DocumentNode.SelectNodes("//a");
return a_nodes.Select(p => p.GetAttributeValue("href", "")).Where(n => n.Contains("hms_sdk_tool_")).FirstOrDefault();
}

c# using html agility pack URI formats not supported

I am trying to use HTML agility pack to get my program to read in a file and get all the image srcs from it. Heres what I got so far:
private ArrayList GetImageLinks(String html,String link)
{
//link = url of webpage
//html = a string of the html, just for testing will remove after
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(link);
List<String> imgs = (from x in htmlDoc.DocumentNode.Descendants()
where x.Name.ToLower() == "img"
select x.Attributes["src"].Value).ToList<String>();
Console.Out.WriteLine("Hey");
ArrayList imageLinks = new ArrayList(imgs);
foreach (String element in imageLinks)
{
Console.WriteLine(element);
}
return imageLinks;
}
And this is the error im getting:
System.ArgumentException: URI formats are not supported.
HtmlDocument docHtml = new HtmlWeb().Load(url);

how to use html agility pack to extract all url from html text

Often I extract file names from html text data using regex but I heard the html agility pack is good for parsing html data. how can I use html agility pack to extract all url from html data. Can any one guide me with sample code. Thanks.
This is my code sample which works fine.
using System.Text.RegularExpressions;
private ArrayList GetFilesName(string Source)
{
ArrayList arrayList = new ArrayList();
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
if (!match.get_Value().StartsWith("http://"))
{
arrayList.Add(Path.GetFileName(match.get_Value()));
}
match.NextMatch();
}
ArrayList arrayList1 = arrayList;
return arrayList1;
}
private string ReplaceSrc(string Source)
{
Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
MatchCollection matchCollection = regex.Matches(Source);
foreach (Match match in matchCollection)
{
string value = match.get_Value();
string str = string.Concat("images/", Path.GetFileName(value));
Source = Source.Replace(value, str);
match.NextMatch();
}
string source = Source;
return source;
}
Something like:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var images = doc.DocumentNode.Descendants("img")
.Where(i => i.GetAttributeValue("src", null) != null)
.Select(i => i.Attributes["src"].Value);
This selects all the <img> elements from the document which have src property set, and return these URLs.
Select all img tags with non-empty src attribute (otherwise you will get NullReferenceException during getting attribute value):
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[#src!='']")
.Select(i => i.Attributes["src"].Value);

Substring without breaking html c#

Hi guys I'm trying to take a description which has been entered in a wysiwyg editor and take a substring of it..
i.e
This is some <span style="font-weight:bold;">text</span>
I'd like to limit some descriptions without breaking the html if i just substring and add ...
it breaks the html tags..
I've tried:
string HtmlSubstring(string html, int maxlength)
{
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";
var expression = new Regex(string.Format("({0})|(.?)", htmltag));
MatchCollection matches = expression.Matches(html);
int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
if (match.Value.Length == 1 && i < maxlength)
{
content.Append(match.Value);
i++;
}
else if (match.Value.Length > 1)
{
content.Append(match.Value);
}
}
return Regex.Replace(content.ToString(), emptytags, string.Empty);
}
but it doesn't quite get me there!
Use the HTML Agility Pack to load the HTML and then get InnerText.
var document = new HtmlDocument();
document.LoadHtml("...");
document.DocumentNode.InnerText;
Also see C#: HtmlAgilityPack extract inner text

Categories

Resources