I'm using Argotic Syndication Framework for processing feeds.
But the problem is, if I pass a URL to Argotic, which is not a valid feed (for example, http://stackoverflow.com which is a html page, not feed), the program hangs (I mean, Argotic stays in an infinity loop)
So, How to check if a URL is pointing to a valid feed?
From .NET 3.5 you can do this below. It will throw an exception if it's not a valid feed.
using System.Diagnostics;
using System.ServiceModel.Syndication;
using System.Xml;
public bool TryParseFeed(string url)
{
try
{
SyndicationFeed feed = SyndicationFeed.Load(XmlReader.Create(url));
foreach (SyndicationItem item in feed.Items)
{
Debug.Print(item.Title.Text);
}
return true;
}
catch (Exception)
{
return false;
}
}
Or you can try parsing the document by your own:
string xml = "<?xml version=\"1.0\" encoding=\"utf-8\" ?>\n<event>This is a Test</event>";
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xml);
Then try checking the root element. It should be the feed element and have "http://www.w3.org/2005/Atom" namespace:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:re="http://purl.org/atompub/rank/1.0">
References:
http://msdn.microsoft.com/en-us/library/system.servicemodel.syndication.syndicationfeed.aspx
http://dotnet.dzone.com/articles/systemservicemodelsyndication
you can use Feed Validation Service. It has SOAP API.
You can check the content type. It has to be text/xml. See this question to find the content type.
you can use this code:
var request = HttpWebRequest.Create("http://www.google.com") as HttpWebRequest;
if (request != null)
{
var response = request.GetResponse() as HttpWebResponse;
string contentType = "";
if (response != null)
contentType = response.ContentType;
}
thanks to the answer of the question
Update
To check if it is a feed address you can use W3C Feed Validation service.
Update2
as BurundukXP said it has a SOAP API. to work with it you can read the answer of this question.
If you want to just have it transformed into valid RSS/ATOM, you can use http://feedcleaner.nick.pro/ to have it sanitized. Alternatively, you can fork the project.
Related
I'm getting geografic info from a webservice.
I'm trying to parse the return data for hours, but have been getting no where.
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
StreamReader reader = new StreamReader(response.GetResponseStream());
string result = reader.ReadToEnd();
XDocument document = XDocument.Parse(result, LoadOptions.None);
i got this
document <html>
<body>
<state>Apure</state>
<municipality>RĂ“MULO GALLEGOS</municipality>
<parish>URBANA ELORZA</parish>
<street>La Trinidad De Arauca</street>
</body>
</html> System.Xml.Linq.XDocument
I try
document.Elements("state")
document.Descendants("body")
document.GetElementsByTagName("state");
But nothing.
I'm sure there is a simple way of do something so basic.
I'm seriously considering convert that to a string and do the parsing myself.
Aditional consideration:
The fields include it in the result is variable.
Because some info doesnt have all fields.
Ok, I make a change.
I read a XElement instead of a XDocument;
XElement sitemap = XElement.Parse(result, LoadOptions.None);
foreach (var bodyElement in sitemap.Elements("body"))
{
foreach (var fieldElement in bodyElement.Elements())
{
Console.WriteLine(fieldElement.Name);
Console.WriteLine(fieldElement.Value);
}
}
Probably there is a way to skip the first foreach, but still looking for it.
#Jonesy line works but that mean I have to know the fields names. This way i just create the info for the values I got.
I am trying to create web request, which sends XML via POST call and would like to return the response back in XML.
I am having a little difficulty with the response back xml, as I am little I unsure how do I set that up int he code below. here is my attempt:
// Attempt to receive the WebResponse to the WebRequest.
using (HttpWebResponse hwresponse = (HttpWebResponse)hwrequest.GetResponse())
{
statusCode = (int)hwresponse.StatusCode;
if (hwresponse != null)
{ // If we have valid WebResponse then read it.
using (StreamReader reader = new StreamReader(hwresponse.GetResponseStream()))
{
// XPathDocument doc = new XPathDocument(reader);
string responseString = reader.ReadToEnd();
if (statusCode == 201 )
{
// var response = new XElement("Status",
// new XElement("status_code", statusCode),
// new XElement("resources_created",
//// new XElement("Link"),
// new XElement("href"),
// new XElement("title")
// ),
// new XElement("warnings")
// );
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(responseString);
XmlNodeList address = xmlDoc.GetElementsByTagName("Status");
responseData = xmlDoc.ToString();
reader.Close();
}
}
}
hwresponse.Close();
}
}
catch (WebException e)
{
if (e.Status == WebExceptionStatus.ProtocolError)
{
// XmlDocument xmlDoc = new XmlDocument();
// XmlNodeList address = xmlDoc.GetElementsByTagName("Status", statusCode);
// xmlDoc.Load(xmlDoc);
}
// if (e.Status == WebExceptionStatus.ProtocolError)
// {
// responseData = "Status Code : {0}" + ((HttpWebResponse)e.Response).StatusCode + "Status Description : {0}" + ((HttpWebResponse)e.Response).StatusDescription;
// responseData "Status Description : {0}" + ((HttpWebResponse)e.Response).StatusDescription;
// }
}
I would like to be able to return the response back in the following XML format:
<status>
<status_code>201</status_code>
<etag>12345678</etag>
<resources_created>
<link
rel="http://api-info.com"
href="http://api-info.com/tag/Some%20Tag"
title="Subscriber Tag (Some Tag)" />
</resources_created>
<warnings>
<warning>Some Warning Message</warning>
</warnings>
</status>
I would also like to ask, if my 'StatusCode' should be setup as if conditions or try&catch.
Any guide would be most helpful. Many thanks.
You may not have any control over what is sent to you but you can ask for xml with an Accept header.
hwrequest.Accept = "application/xml";
However, you will have no control over the structure.
Yes, you should handle the response status (200, 201, 404 etc.) using If/Else statements and NOT rely on try/catch to handle your logic. Try/Catch is for error handling, and not a place to handle regular application flow.
For the Web Requests you are using an obsolete API. Unless there is a specific limitation that forces you to use HttpWebRequest and HttpWebResponse, you should use a newer (and simpler) API like WebClient or HttpClient (only .NET 4.5).
http://msdn.microsoft.com/en-us/library/system.net.webclient%28v=vs.110%29.aspx
http://msdn.microsoft.com/en-us/library/system.net.http.httpclient%28v=vs.118%29.aspx
For response handling i would advice using Linq to XML instead of the old XmlDocument API.
If your response XML has the "status" element at the root of the XML document, then you can do:
var xmlDoc = XDocument.Load(reader);
var statusXml = xmlDoc.ToString();
If the "status" element is a children of another root XML element, then you can do:
var xmlDoc = XDocument.Load(reader);
var statusElement = xmlDoc.Root.Element("status");
var statusXml = statusElement.ToString();
If you still want to use the old HTTP API, You can get rid of
string responseString = reader.ReadToEnd();
and pass directly the StreamReader in the XDocument.Load method as in my example.
In case you upgrade your solution to use e.g. WebClient you can use the DownloadString() method, and then load the string result into the XDocument.Load() method.
I need to have access at the HTML of a Facebook page, to extract from it some data. So, I need to create a WebRequest.
Example:
My code worked well for other sites, but for Facebook, I must be logged in to can access the HTML.
How can I use Firefox data for creating a WebRequest for Facebook page?
I tried this:
List<string> HTML_code = new List<string>();
WebRequest request = WebRequest.Create(URL);
using (WebResponse response = request.GetResponse())
using (StreamReader stream = new StreamReader(response.GetResponseStream()))
{
string line;
while ((line = stream.ReadLine()) != null)
{
HTML_code.Add(line);
}
}
...but the HTML resulted is the HTML of Facebook Home Page when I am not logged in.
If what you are trying to is retrieve the number of likes from a Facebook page, you can use Facebook's Graph API service. Just too keep it simple, this is what I basically did in the code:
Retrieve the Facebook page's data. In this case I used the Coke page's data since it was an example FB had listed.
Parse the returned Json using Json.Net. There are other ways to do this, but this just keeps it simple, and you can get Json.Net over at Codeplex. The documentation that I looked for my code was from this page in the docs. Their documentation will also help you with parsing and serializing even more Json if you need to.
Then that basically translates in to this code. Just note that I left out all the fancy exception handling to keep it simple as using networking is not always reliable! Also don't forget to include the Json.Net library in your project!
Usings:
using System.IO;
using System.Net;
using Newtonsoft.Json.Linq;
Code:
string url = "https://graph.facebook.com/cocacola";
WebClient client = new WebClient();
string jsonData = string.Empty;
// Load the Facebook page info
Console.WriteLine("Connecting to Facebook...");
using (Stream data = client.OpenRead(url))
{
using (StreamReader reader = new StreamReader(data))
{
jsonData = reader.ReadToEnd();
}
}
// Get number of likes from Json data
JObject jsonParsed = JObject.Parse(jsonData);
int likes = (int)jsonParsed.SelectToken("likes");
// Write out the result
Console.WriteLine("Number of Likes: " + likes);
I need help to pull RSS feeds from a facebook page I'm using the following code but it keeps giving me an error :
string url =
"https://www.facebook.com/feeds/page.php?id=40796308305&format=rss20";
XmlReaderSettings settings =
new XmlReaderSettings
{
XmlResolver = null,
DtdProcessing=DtdProcessing.Parse,
};
XmlReader reader = XmlReader.Create(url,settings);
SyndicationFeed feed = SyndicationFeed.Load(reader);
foreach (var item in feed.Items)
{
Console.WriteLine(item.Id);
Console.WriteLine(item.Title.Text);
Console.WriteLine(item.Summary.Text);
}
if (reader != null) reader.Close();
This code works perfectly with any blog or page rss but with Facebook rss it give an exception with the following message
The element with name 'html' and namespace 'http://www.w3.org/1999/xhtml' is not an allowed feed format.
Thanks
Facebook will return HTML in this instance because it doesn't like the User Agent supplied by XmlReader. Since you can't customize it, you will need a different solution to grab the feed. This should solve your problem:
var req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Fiddler";
var rep = req.GetResponse();
var reader = XmlReader.Create(rep.GetResponseStream());
SyndicationFeed feed = SyndicationFeed.Load(reader);
This is strictly a behavior of Facebook, but the proposed change should work equally well for other sites that are okay with your current implementation.
It works when using Gregorys code above if you change the feed format to atom10 instead of rss20.
Change the url:
string url =
"https://www.facebook.com/feeds/page.php?id=40796308305&format=atom10";
In my case also Facebook feed was difficult to consume and then I try with feedburner to burn the feed for my facebook page. Feedburner generated the feed for me in Atom1.0 format. And then I successfully :) consumed this with system.syndication class my code was:
string Main()
{
var url = "http://feeds.feedburner.com/Per.........all";
Atom10FeedFormatter formatter = new Atom10FeedFormatter();
using (XmlReader reader = XmlReader.Create(url))
{
formatter.ReadFrom(reader);
}
var s = "";
foreach (SyndicationItem item in formatter.Feed.Items)
{
s+=String.Format("[{0}][{1}] {2}", item.PublishDate, item.Title.Text, ((TextSyndicationContent)item.Content).Text);
}
return s;
}
Given a Url, I'd like to be able to capture the Title of the page this url points to, as well
as other info - eg a snippet of text from the first paragraph on a page? - maybe even an image from the page.
Digg.com does this nicely when you submit a url.
How could something like this be done in .Net c#?
You're looking for the HTML Agility Pack, which can parse malformed HTML documents.
You can use its HTMLWeb class to download a webpage over HTTP.
You can also download text over HTTP using .Net's WebClient class.
However, it won't help you parse the HTML.
You could try something like this:
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using System.Text;
namespace WebGet
{
class progMain
{
static void Main(string[] args)
{
ASCIIEncoding asc = new ASCIIEncoding();
WebRequest wrq = WebRequest.Create("http://localhost");
WebResponse wrp = wrq.GetResponse();
byte [] responseBuf = new byte[wrp.ContentLength];
int status = wrp.GetResponseStream().Read(responseBuf, 0, responseBuf.Length);
Console.WriteLine(asc.GetString(responseBuf));
}
}
}
Once you have the buffer, you can process it looking for paragraph or image HTML tags to extract portions of the returned data.
You can extract the title of a page with a function like the following. You would need to modify the regular expression to look for, say, the first paragraph of text but since each page is different, that may prove difficult. You could look for a meta description tag and take the value from that, however.
public static string GetWebPageTitle(string url)
{
// Create a request to the url
HttpWebRequest request = HttpWebRequest.Create(url) as HttpWebRequest;
// If the request wasn't an HTTP request (like a file), ignore it
if (request == null) return null;
// Use the user's credentials
request.UseDefaultCredentials = true;
// Obtain a response from the server, if there was an error, return nothing
HttpWebResponse response = null;
try { response = request.GetResponse() as HttpWebResponse; }
catch (WebException) { return null; }
// Regular expression for an HTML title
string regex = #"(?<=<title.*>)([\s\S]*)(?=</title>)";
// If the correct HTML header exists for HTML text, continue
if (new List<string>(response.Headers.AllKeys).Contains("Content-Type"))
if (response.Headers["Content-Type"].StartsWith("text/html"))
{
// Download the page
WebClient web = new WebClient();
web.UseDefaultCredentials = true;
string page = web.DownloadString(url);
// Extract the title
Regex ex = new Regex(regex, RegexOptions.IgnoreCase);
return ex.Match(page).Value.Trim();
}
// Not a valid HTML page
return null;
}
You could use Selenium RC (Open Source, www.seleniumhq.org) to parse data etc. from the pages. It is a web test automation tool with an C# .Net lib.
Selenium have full API to read out specific items on a html page.