Get all RSS links on a website - c#

I'm currently writing a very basic program that'll firstly go through the html code of a website to find all RSS Links, and thereafter put the RSS Links into an array and parse each content of the links into an existing XML file.
However, I'm still learning C# and I'm not that familiar with all the classes yet. I have done all this in PHP by writing own class with get_file_contents() and as well been using cURL to do the work. I managed to get around it with Java also. Anyhow, I'm trying to accomplish the same results by using C#, but I think I'm doing something wrong here.
TLDR; What's the best way to write the regex to find all RSS links on a website?
So far, my code looks like this:
private List<string> getRSSLinks(string websiteUrl)
{
List<string> links = new List<string>();
MatchCollection collection = Regex.Matches(websiteUrl, #"(<link.*?>.*?</link>)", RegexOptions.Singleline);
foreach (Match singleMatch in collection)
{
string text = singleMatch.Groups[1].Value;
Match matchRSSLink = Regex.Match(text, #"type=\""(application/rss+xml)\""", RegexOptions.Singleline);
if (matchRSSLink.Success)
{
links.Add(text);
}
}
return links;
}

Don't use Regex to parse html. Use an html parser instead See this link for the explanation
I prefer HtmlAgilityPack to parse htmls
using (var client = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(client.DownloadString("http://www.xul.fr/en-xml-rss.html"));
var rssLinks = doc.DocumentNode.Descendants("link")
.Where(n => n.Attributes["type"] != null && n.Attributes["type"].Value == "application/rss+xml")
.Select(n => n.Attributes["href"].Value)
.ToArray();
}

Related

crawling price gives null , HtmlAgilityPack (C#)

Im trying to get stock data from a website with webcrawler as a hobby project. I got the link to work, i got the Name of the stock but i cant get the price... I dont know how to handle the html code. Here is my code,
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var divs = htmlDocument.DocumentNode.Descendants("div").Where(n => n.GetAttributeValue("class", "").Equals("Flexbox__StyledFlexbox-sc-1ob4g1e-0 eYavUv Row__StyledRow-sc-1iamenj-0 foFHXj Rows__AlignedRow-sc-1udgki9-0 dnLFDN")).ToList();
var stocks = new List<Stock>();
foreach (var div in divs)
{
var stock = new Stock()
{
Name = div.Descendants("a").Where(a=>a.GetAttributeValue("class","").Equals("Link__StyledLink-sc-apj04t-0 foCaAq NameCell__StyledLink-sc-qgec4s-0 hZYbiE")).FirstOrDefault().InnerText,
changeInPercent = div.Descendants("span").Where((a)=>a.GetAttributeValue("class", "").Equals("Development__StyledDevelopment-sc-hnn1ri-0 kJLDzW")).FirstOrDefault()?.InnerText
};
stocks.Add(stock);
}
foreach (var stock in stocks)
{
Console.WriteLine(stock.Name + " ");
}
I got the Name correct, but i dont really know how the get the ChangeInPercent.... I will past in the html code below,
The top highlight show where i got the name from, and the second one is the "span" i want. I want the -4.70
Im a litle bit confused when it comes to get the data with my code. I tried everything. My changeInPercent property is a string.
it has to be the code somehow...
There's probably an easier to select a single attribute/node than the way you're doing it right now:
If you know the exact XPath expression to select the node you're looking for, then you can do the following:
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
var changeInPercent = htmlDocument.DocumentNode
.SelectSingleNode("//foo/bar")
.InnerText;
Getting the right XPath expression (the //foo/bar example above) is the tricky part. But this can be found quite easy using your browser's dev tools. You can navigate to the desired element and just copy it's XPath expression - simple as that! See here for a sample on how to copy the expression.

C# grab urls using htmlagility

Okay so I have this list of URLs on this webpage, I am wondering how do I grab the URLs and add them to a ArrayList?
http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A
I only want the URLs which are in the list, look at it to see what I mean. I tried doing it myself and for whatever reason it takes all of the other URLs except for the ones I need.
http://pastebin.com/a7hJnXPP
Using Html Agility Pack
using (var wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.animenewsnetwork.com/encyclopedia/anime.php?list=A"));
var links = doc.DocumentNode.SelectSingleNode("//div[#class='lst']")
.Descendants("a")
.Select(x => x.Attributes["href"].Value)
.ToArray();
}
If you want only the ones in the list, then the following code should work (this is assuming you have the page loaded into an HtmlDocument already)
List<string> hrefList = new List<string>(); //Make a list cause lists are cool.
foreach (HtmlNode node animePage.DocumentNode.SelectNodes("//a[contains(#href, 'id=')]"))
{
//Append animenewsnetwork.com to the beginning of the href value and add it
// to the list.
hrefList.Add("http://www.animenewsnetwork.com" + node.GetAttributeValue("href", "null"));
}
//a[contains(#href, 'id=')] Breaking this XPath down as follows:
//a Select all <a> nodes...
[contains(#href, 'id=')] ... that contain an href attribute that contains the text id=.
That should be enough to get you going.
As an aside, I would suggest not listing each link in its own messagebox considering there are around 500 links on that page. 500 links = 500 messageboxes :(

Agility Pack XPath Issue

I am attempting to use the HTML Agility Pack to look up specific keywords on Google, then check through linked nodes until it find my websites string url, then parse the innerHTML of the node I am on for my Google ranking.
I am relatively new to the Agility Pack (as in, I started really looking through it yesterday) so I was hoping I could get some help on it. When I do the search below, I get Failures on my Xpath queries every time. Even if I insert something as simple as SelectNodes("//*[#id='rso']"). Is this something I am doing incorrectly?
private void GoogleScrape(string url)
{
string[] keys = keywordBox.Text.Split(',');
for (int i = 0; i < keys.Count(); i++)
{
var raw = "http://www.google.com/search?num=100&q=";
string search = raw + HttpUtility.UrlEncode(keys[i]);
var webGet = new HtmlWeb();
var document = webGet.Load(search);
loadtimeBox.Text = webGet.RequestDuration.ToString();
var ranking = document.DocumentNode.SelectNodes("//*[#id='rso']");
if (ranking != null)
{
googleBox.Text = "Something";
}
else
{
googleBox.Text = "Fail";
}
}
}
It's not the Agility pack's guilt -- it is tricky google's. If you inspect _text property of HtmlDocument with debugger, you'll find that <ol> that has id='rso' when you inspect it in a browser do not have any attributes for some reason.
I think, in this case you can just serach by "//ol", because there is only one <ol> tag in the google's result page at the moment...
UPDATE: I've done further checks. For example when I do this:
using (StreamReader sr =
new StreamReader(HttpWebRequest
.Create("http://www.google.com/search?num=100&q=test")
.GetResponse()
.GetResponseStream()))
{
string s = sr.ReadToEnd();
var m2 = Regex.Matches(s, "\\sid=('[^']+'|\"[^\"]+\")");
foreach (var x in m2)
Console.WriteLine(x);
}
The only ids that are returned are: "sflas", "hidden_modes" and "tbpr_12".
To conclude: I've used Html Agility Pack and it's coped pretty well even with malformed html (unclosed <p> and even <li> tags etc.).

Parse particular text from an XML string

Im writing an app which reads an RSS feed and places items on a map.
I need to read the lat and long numbers only from this string:
http://www.digitalvision.se/feed.aspx?isAlert=true&lat=53.647351&lon=-1.933506
.This is contained in link tags
Im a bit of a programming noob but im writing this in C#/Silverlight using Linq to XML.
Shold this text be extrated when parsing or after parsing and sent to a class to do this?
Many thanks for your assistance.
EDIT. Im going to try and do a regex on this
this is where I need to integrate the regex somewhere in this code. I need to take the lat and long from the Link element and seperate it into two variables I can use (the results are part of a foreach loop that creates a list.)
var events = from ev in document.Descendants("item")
select new
{
Title = (ev.Element("title").Value),
Description = (ev.Element("description").Value),
Link = (ev.Element("link").Value),
};
Question is im not quite sure where to put the regex (once I work out how to use the regex properly! :-) )
try this
var url = "http://www.xxxxxxxxxxxxxx.co.uk/map.aspx?isTrafficAlert=true&lat=53.647351&lon=-1.93350";
var items = url.Split('?')[1]
.Split('&')
.Select(i => i.Split('='))
.ToDictionary(o => o[0], o => o[1]);
var lon = items["lon"];
var lat = items["lat"];
If you only need the Lat and Lon values and the feed is just one big XML string you can do the whole thing with a regular expression.
var rssFeed = #"http://www.xxxxxxxxxxxxxx.co.uk/map.aspx?isTrafficAlert=true&lat=53.647351&lon=-1.933506
http://www.xxxxxxxxxxxxxx.co.uk/map.aspx?isTrafficAlert=true&lat=53.647352&lon=-1.933507
http://www.xxxxxxxxxxxxxx.co.uk/map.aspx?isTrafficAlert=true&lat=53.647353&lon=-1.933508
http://www.xxxxxxxxxxxxxx.co.uk/map.aspx?isTrafficAlert=true&lat=53.647354&lon=-1.933509";
var regex = new Regex(#"lat=(?<Lat>[+-]?\d*\.\d*)&lon=(?<Lon>[+-]?\d*\.\d*)");
var latLongPairs = new List<Tuple<decimal, decimal>>();
foreach (Match match in regex.Matches(rssFeed))
{
var lat = Convert.ToDecimal(match.Groups["Lat"].Value);
var lon = Convert.ToDecimal(match.Groups["Lon"].Value);
latLongPairs.Add(new Tuple<decimal, decimal>(lat, lon));
}

Split html row into string array

I have data in an html file, in a table:
<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>
How do I split a single row into an array or list?
string row = streamReader.ReadLine();
List<string> data = row.Split //... how do I do this bit?
string artist = data[1];
Short answer: never try to parse HTML from the wild with regular expressions. It will most likely come back to haunt you.
Longer answer: As long as you can absolutely, positively guarantee that the HTML that you are parsing fits the given structure, you can use string.Split() as Jenni suggested.
string html = "<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>";
string[] values = html.Split(new string[] { "<tr>","</tr>","<td>","</td>" }, StringSplitOptions.RemoveEmptyEntries);
List<string> list = new List<string>(values);
Listing the tags independently keeps this slightly more readable, and the .RemoveEmptyEntries will keep you from getting an empty string in your list between adjacent closing and opening tags.
If this HTML is coming from the wild, or from a tool that may change - in other words, if this is more than a one-off transaction - I strongly encourage you to use something like the HTML Agility Pack instead. It's pretty easy to integrate, and there are lots of examples on the Intarwebs.
If your HTML is well-formed you could use LINQ to XML:
string input = #"<table>
<tr><td>001</td><td>MC Hammer</td><td>Can't Touch This</td></tr>
<tr><td>002</td><td>Tone Loc</td><td>Funky Cold Medina</td></tr>
<tr><td>003</td><td>Funkdoobiest</td><td>Bow Wow Wow</td></tr>
</table>";
var xml = XElement.Parse(input);
// query each row
foreach (var row in xml.Elements("tr"))
{
foreach (var item in row.Elements("td"))
{
Console.WriteLine(item.Value);
}
Console.WriteLine();
}
// if you really need a string array...
var query = xml.Elements("tr")
.Select(row => row.Elements("td")
.Select(item => item.Value)
.ToArray());
foreach (var item in query)
{
// foreach over item content
// or access via item[0...n]
}
You could try:
Row.Split /<tr><td>|<\/td><td>|<\/td><\/tr>/
But it depends on how regular the HTML is. Is it programmatically generated, or does a human write it? You should only use a regular expression if you're sure it will always be generated the same way, otherwise you should use a proper HTML parser
When parsing HTML, I usually turn to the HTML Agility Pack.

Categories

Resources