I'm currently working on an ASP.NET Website where I want to retrieve data from an RSS feed. I can easily retrieve the data I want and get it to show in i.e. a Repeater control.
My problem is, that the blog (Wordpress) that I'm getting the RSS from uses \n for linebreaks which I obviously can't use in HTML. I need to replace these \n with a <br /> tag.
What I've done so far is:
SyndicationFeed myFeed = SyndicationFeed.Load(XmlReader.Create("urltofeed/"));
IEnumerable<SyndicationItem> items = myFeed.Items;
foreach(SyndicationItem item in items)
{
Feed f = new Feed();
f.Content = f.ConvertLineBreaks(item.Summary.Text);
f.Title = item.Title.Text;
feedList.Add(f);
}
rptEvents.DataSource = feedList;
rptEvents.DataBind();
Then having a Feed class with two properties: Title and Content and a helper-method to replace \n with <br />
However, I'm not sure if this is a good/pretty approach to get data from an RSS feed?
Thanks in advance,
Bo
If you are adverse to all the xml parsing in your code you can also run the rss xml schema through xsd and generate a topic and feed class in you code.
This classes should serialize/deserialize to xml. This may be overkill but it's worked great for me when integrating with a standard xml api for a third party.
Does it have anything to do with the type of rss feed you're consuming?
http://codex.wordpress.org/WordPress_Feeds
If it's not returning from the RSS feed formatted as you wish, you have little other choice. That's definitely not a particuarly bad way.
This is a sane approach in my opinion.
Remember that RSS is a data format and not HTML. Replacing \n with in order to get your wanted HTML is just something which has to be done every now and then. (Unless you want to use <pre>)
I find people stuffing way too much html in xml structures, not considering they will be used for other things than web pages. UI and data should be separated.
Related
I am learning screen-scraping using C# and I was wondering
How can I separate certain pieces of gathered html,
I am using htmlAgilityPack and ScrapySharp library's for scraping thus with this code I can retrieve a html page:
WebPage PageResult = Browser.NavigateToPage(new Uri("localhost"));
Console.WriteLine(PageResult);
Of course I get back the whole source code with all the syntax and mishmash, but what If I wanted to only catch data between <h2></h2> tags and omit all else?
My very simple-minded pseudo code would be:
If result reads h2
Trim all behind
start writing out after
If result reads /h2
stop writing
Trim anything that comes after
The main question I'm having is how do I feed In the rule that when I read h2 trim everything from before, write the data after that and if /h2 appears, stop and trim the end of the result?
There are a few ways you can achieve this, one such would be to red the page as XML and parse the data you are looking for,
This can be with the use of,
XElement
XmlElement
XDocument
etc.
The second way, would be to use a third-party library like HtmlAgilityPack, this also supports XPath as well,
var nodes = doc.DocumentNode.SelectNodes("//form//input");
Can you say me how to parse data from this link?
http://www.e1.ru/business/job/resume.detail.php?id=956004
I tryed something like this
var nodes = doc.DocumentNode.SelectNodes("/html[1]/body[1]/table[5]/tbody[1]/tr[1]/td[2]/table[4]/tbody[1]/tr[1]/td[1]/table[1]");
but it is not good variant.
Abbath, I recommend using some 3rd party tools. which can extract data from HTML and then extract your required data. like egrabber, rchilli and many more .
if you are looking for your own solution - then add a index of complete text, and then catch them as XML - study DOM structure and pick out selective values.
I am currently creating a RSS feed linked to a custom built news column. The news column uses a series of query strings in order to direct the user to a specific post or posts! However the problem I am facing is that the rss feed is replacing some of these query strings with random numbers. For instance:
http://www.correlatesearch.com/news.aspx?cat=BusinessManagementControls&nw=
&nw= is being replace with
&
Can anyone direct to a way around this??
Many thanks!
My guess is that you're looking at the raw RSS - which is XML. Within XML, & has to be escaped as &. This is far from "random numbers".
I suspect you'll find that &nw= is actually being escaped to &nw= - in which case it's not actually changing your content at all. It's representing the text of your URL in an XML-appropriate way. When the XML is read by a client, it will (or should) understand it appropriately.
Is the feed an XML document? Then the replacement should take place. It is called escaping character entities.
And I don't see any "random numbers" that you referred to...
I'm trying to use the SyndicationFeed class to get content from a feed, but for some feeds it works ok but for others it doesn't.
For example, when I want to get feeds from http://www.alistapart.com/site/rss , the LastUpdatedTime has a value of 01-01-0001 for all feed items and the feed itself.
Is there something that i need to do? or is it maybe that SyndicationFeed doesn't support all websites to read from them all the information?
some sample code that i'm using :
var feed = SyndicationFeed.Load(XmlReader.Create("http://www.alistapart.com/site/rss"));
var feedPosts = feed.Items; // here all feedPosts have the invalid LastUpdatedTime, but if i go to the website i can see that there is actually one
You are looking at the LastUpdatedTime while the date in the RSS you mentioned is not LastUpdatedTime nor the more common date pubDate. Note the namespace which is "http://purl.org/dc/elements/1.1/".
Most of these elements are optional and you must be able to live without them or use alternative ones.
I have create a Podcast software and I have found the SyndicationFeed implementation very poor and brittle to deal with various dates which are there in the real world.
UPDATE
is there a way to use the framework's
classes to parse this non standard
attributes?
Yes, have a look at Element Extensions.
I'm currently playing around with a CMS idea I've got. It's based on a MonoRail, NHibernate stack. I know there are already a million CMS solutions out there. This is more for my benefit for trying some new stuff out.
Anyway, the admin side of things is going well with a plugin architecture in full flow, however I've hit a bit of a road block with the front end template management side of things.
What I'm wanting to do is allow developers to write their own custom tags e.g.
<cms:news>
<h1><cms:news:title /></h1>
<p><cms:news:date /></p>
<cms:news:story />
</cms:news>
I believe this will give developers a great deal of flexibility.
The part I'm struggling with is the parsing of these tags. I could use reflection, however I'm worried that this may be quite intensive for every page. Has anyone else done something like this, that has a better solution?
Sorry for the lack of info guys. Here is a bit more info for you.
The above code would site in a "page" in the CMS. The complete page markup would simply be a DB record.
Once the parser hits there tags it would then need to process them to convert them to content. In the example above the parser would hit the cms:news tag and make a call to a function like this
public void news()
{
// Get all of the news articles from the database
}
The cms:news:title (or cms:news.title) tag would call a function like this
public string newstitle()
{
// Return the news title for the current news element we are rendering
}
Hope this makes more sense now
Thanks
John
I think I've been looking at this all wrong.
I could basically do this my using something like the Spark View Engine's InMemoryViewFolder and using ViewComponents for the custom tags.
The tags you're considering to use are not valid XML : you can't have multiple colons in an element name (only one to separate the namespace from the local name)
Consider this instead :
<cms:news>
<h1><cms:news.title /></h1>
<p><cms:news.date /></p>
<cms:news.story />
</cms:news>
To parse this XML, there are a number of options available to you :
XmlReader
XmlDocument
XDocument (Linq to XML)
I don't think XML serialization is an option if the tags are customizable...
Anyway, I'm not sure what you're trying to achieve exactly... What would you do with those tags ? Could you be more specific in your question ?