Consuming data from WikiNews

Consuming data from WikiNews - c#

I have been scouring the net but I can't seem to find any examples of consuming data from WikiNews. They have an RSS feed with links to individual stories as HTML, but I would like to get the data in a structured format such as XML etc.
By structured format I mean an XML file for each story that has a defined XML schema (XSD) file. See: [http://www.w3schools.com/schema/schema_intro.asp][2]
Has anyone written a program that consumes stories from WikiNews? Do they have a documented API?
I would like to use C# to collect selected stories and store them in SQL Server 2008.
[2]: By "structured format" I mean something like an XML schema (XSD) file. See: http://www.w3schools.com/schema/schema_intro.asp

The software they use has an API but I'm not sure if WikiNews supports it.

Their feed: http://feeds.feedburner.com/WikinewsLatestNews
If you put that in your browser and read the source, you'll see that it is XML. The XML contains the title, description, a link, etc. Only the description is in HTML.
Here is the beginning of the response:
<rss xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
<channel>
<title>Wikinews</title>
<description>Wikinews RSS feed</description>
<language>en</language>
<link>http://en.wikinews.org</link>
<copyright>Creative Commons Attribution 2.5 (unless otherwise noted)</copyright>
<generator>Wikinews Fetch</generator>
<ttl>180</ttl>
<docs>http://none</docs>
<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/WikinewsLatestNews" /><feedburner:info uri="wikinewslatestnews" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><creativeCommons:license>http://creativecommons.org/licenses/by/2.5/</creativeCommons:license><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site.</feedburner:browserFriendly><item>
<title>Lufthansa pilots begin strike</title>
<link>http://feedproxy.google.com/~r/WikinewsLatestNews/~3/1K2xloPGlmI/Lufthansa_pilots_begin_strike</link>
<description><p><a href="http://en.wikinews.org/w/index.php?title=File:LocationGermany.png&filetimestamp=20060604120306" class="image" title="A map showing the location of Germany"><img alt="A map showing the location of Germany" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/de/LocationGermany.png/196px-LocationGermany.png" width="196" height="90" /></a></p>
<p><b class="published"><span id="publishDate" class="value-title" title="2010-02-22"></span>Monday, February 22, 2010</b></p>
<p>The pilot's union of <a href="http://en.wikinews.org/wiki/Germany" title="Germany" class="mw-redirect">German</a> airline <a href="http://en.wikipedia.org/wiki/Lufthansa" class="extiw" title="w:Lufthansa">Lufthansa</a> have begun a four-day strike over pay and job security. Operations at subsidiary airlines <a href="http://en.wikipedia.org/wiki/Lufthansa_Cargo" class="extiw" title="w:Lufthansa Cargo">Lufthansa Cargo</a> and <a href="http://en.wikipedia.org/wiki/Germanwings" class="extiw" title="w:Germanwings">Germanwings</a> are also affected by the strike.</p>
<em><a href='http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike'>More...</a></em><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:YwkR-u9nhCs"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=YwkR-u9nhCs" border="0"></img></a>
</div></description>
<guid isPermaLink="false">http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike</guid>
<feedburner:origLink>http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike</feedburner:origLink></item>

Your question is really unclear! but I guess you want to format the feeds of WikiNews, to be readable in a more friendly way (as if you are reading it in WikiNews itself), am I correct?
If so, then you have to know that RSS are XML with a standard format, and not related to WikiNews, and you can transform any RSS feeds to be displayed in -say- HTML with XSLT.
If you need to get the story itself, you can use the given link in the feed, and display it in a webbrowser control (if you are developing a windows application).
Do you need something else other than what I have said?

Related

C# Webservice response xml displaying as pdf

I am new to C# web development. I am developing a software that receives response from webservice in XML format. (includes barcodes generated by webservice).
There is an option given by webservice provider, that i have to add a line
(Example<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">)
as a second line in the xml and display in web browser by using style sheets provided by webservice provider. If i have to choose this option, how can i add that line as second line in the received xml file also how can i map the style sheets provided by the webserive in the project for this xml.
If i dont take that option, Is it possible to display the data in xml as a pdf(includes barcodes generated by webservice), if i dont choose the option .

If I understand your question correctly, you want to:
Add a stylesheet specification to an existing XML
Convert an XML to PDF
1. ADDING A STYLESHEET
There is an option given by webservice provider, that i have to add a line [...] as a second line in the xml and display in web browser by using style sheets
This is done using e.g. Linq, like in this answer.
First of all, I think the example you used, i.e.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
may be inaccurate, as it is the first line of a XSL file (a stylesheet); those kind of files are used to transform an XML into another file (a different XML or an HTML, like in your case). However, you say
using style sheets provided by webservice provider
so my guess is that you already have those stylesheets and you can you use them, rather than creating them yourself.
If so, the line you want to add is like
<?xml-stylesheet type="text/xsl" href="helloWorld.xsl"?>
Let's suppose you already have your XML stored into an XDocument variable named "Document" with its root element being "Root"
var FilePath = "Example.xml";
var Document = XDocument.Load(FilePath);
var Root = XDocument.Descendants("Root").Single();
Then you can add your stylesheet this way, getting a new XML:
var NewDocument = new XDocument(
new XProcessingInstruction("xml-stylesheet", "type='text/xsl'ref='helloWorld.xsl'"),
Root);
2. XML to PDF
There are several ways to do this.
You might parse your XML, retrieve the elements you want to show on your PDF, use a library like iTextSharp to create a specific layout and place their contents on the file.
Or, since you already have an XML and you can transform it to an HTML using an XSL, you can use wkHtmlToPdf.
Let me know if you need more details.

Retrive the Url from an Html Img Tag

BackGround Info
Currently working on a C# web api that will be returning selected Img url's as base64. I currently have the functionality that would preform the base64 conversion however, I am getting a large amount of text which also include Img Url's which I will need to crop out of the string and give it to my function to convert the img to base 64. I read up on an lib.("HtmlAgilityPack;") that should make this task easy but when I am use it I get "HtmlDocument.cs" not found. However, I am not submitting a document, but sending it a string which is HTML. I read the doc and it is suppose to work with a string as well, but it is not working for me. This is the code using "HtmlAgilityPack".
NON WORKING CODE
foreach(var item in returnList)
{
if (item.Content.Contains("~~/picture~~"))
{
HtmlDocument doc = new HtmlDocument();
doc.Load(item.Content);
Error Message From HtmlAgilityPack
Question
I am receiving a string which is Html from SharePoint. This Html string may be tokenized with heading tokens and/or picture tokens. I am trying to isolate the retrieve the html from the img src Hmtl tag. I understand that regex may be impractical, but I would consider working with a regex expressions is it available to retrieve the url from img src.
Sample String
Bullet~~Increased Cash Flow</li><li>~~/Document Text Bullet~~Tax Efficient Organizational Structures</li><li>~~/Document Text Bullet~~Tax Strategies that Closely Align with Business Strategies</li><li>~~/Document Text Bullet~~Complete Knowledge of State and Local Tax Obligations</li></ul><p>~~/Document Heading 2~~is the firm of choice</p><p>~~/Document Text~~When it comes to accounting and advisory services is the unique firm of choice. As a trusted advisor to our clients, we bring an integrated client service approach with dedicated industry experience. Dixon Hughes Goodman respects the value of every client relationship and provides clients throughout the U.S. with an unwavering commitment to hands-on, personal attention from our partners and senior-level professionals.</p><p>~~/Document Text~~of choice for clients in search of a trusted advisor to deal with their state and local tax needs. Through our leading best practices and experience, our SALT professionals offer quality and ease to the client engagement. We are proud to provide highly comprehensive services.</p>
<p>~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/map-al.jpg" alt="map al" style="width:611px;height:262px;" /> 
<br></p><p><br></p><p>
~~/picture~~<br></p><p>
<img src="/sites/ContentCenter/Graphics/Firm_Telescope_Illustration.jpg" alt="Firm_Telescope_Illustration.jpg" style="margin:5px;width:155px;height:155px;" /> </p><p></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div><div class="ExternalClassAF0833CB235F437993D7BEE362A1A88A"><br></div>
Important
I am working with an HTML string, not a file.

The issue you are having is that C# is looking for a file and since it is not finding it, it tells you. This is not an error that will brake your app, it is just telling you that the file is not found and the Lib will than read the string given. This documentation can be found here https://htmlagilitypack.codeplex.com/SourceControl/latest#Trunk/HtmlAgilityPackDocumentation.shfbproj. The code below is a cookie cutter model that anyone can use.
Important
C# is looking for a file which can not be displayed, because it a string that is supplied. That is the message that you are getting, however your still will work as well with accordance to the doc provided and will not effect your code.
Exmample Code
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml("YourContent"); // can be a string or can be a path.
HtmlAttribute att = url.Attributes["src"];
Uri imgUrl = new System.Uri("Url"+ att.Value); // build your url

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].+?>", RegexOptions.IgnoreCase).Groups[1].Value;
It has been asked multiple times here.
also here

Does Linq to XML queries call the entire document or just what you query?

For political reasons (arg) my companies public facing website does not have sql persistence available to it. For this reason I am building an sql powered site that also spits out XML for my public site, this site sits on a different server and does not have access to the server the public site is on. I know, it's just as ugly as it sounds, but I think this is my best option for us at the moment.
Anyhow, I have an xml file for our Locations. Marketing (again arg) of course want's an image associated with each location. This means I have a few options, either one, I can manually enter the name of the image as a string and manually put the image in a folder for the public site to access. Or I can have it stored as byte[] data in the db and xml doc. I am exploring that later.
<Location>
<Id>1</Id>
<Name>Name of Loc</Name>
<Address1>Address Line 1</Address1>
<City>Loc City</City>
<State>Loc State</State>
<Zip>Loc Zip</Zip>
<Latitude>43.244952</Latitude>
<Longitude>-82.74054</Longitude>
<ImageData>/9j/4AAQSkZJRgABAgEASABIAAD... (truncated for sanity sake)</ImageData>
<ImageMimeType>image/jpeg</ImageMimeType>
</Location>
<Location>
<Id>2</Id>
<Name>Name of Loc</Name>
<Address1>Address Line 1</Address1>
<City>Loc City</City>
<State>Loc State</State>
<Zip>Loc Zip</Zip>
<Latitude>43.244952</Latitude>
<Longitude>-82.74054</Longitude>
</Location>
<Location>
<Id>3</Id>
<Name>Name of Loc</Name>
<Address1>Address Line 1</Address1>
<City>Loc City</City>
<State>Loc State</State>
<Zip>Loc Zip</Zip>
<Latitude>43.244952</Latitude>
<Longitude>-82.74054</Longitude>
</Location>
So, once I have an XML doc with 90 locations in it and an image for each the xml doc itself will be insanely large.
My question, before I continue down this path is when I make a call to the XML via Linq to XML, am I only calling the info that is being queried or am I pulling the entire XML doc and extracting the info I need?
If I am indeed pulling in the full XML doc, do you have any suggestions on a better approach? Or would the server be able to handle this info fairly quickly?

The entire document is loaded and then processed. If you are base64 encoding your images (it appears you are), and sticking them in the XML document, then it will likely get very memory intensive very quickly.

Reading and writing XML

I am working on a C# and Android library project. What I am basically trying to do is write a library in Android that will send me crash details onto my server.
I then have a C# console application that runs on my server and processes the data received by Android and from this data I want to generate an XML file, so that another program can read in the XML file and provide a monthly report.
I've got stuck with the best way of writing and reading the XML though.
I've read a lot about it and found various things such as XMLWriter or XMLSerializer but I don't know which works best, nor do I understand entirely how these are implemented.
Below is a basic design of how the XML file should be written, this is what I've written manually to give an understanding of what I want to achieve.
<?xml version="1.0" encoding="utf-8" ?>
<apps>
<app>
<MyApp>
<appID>1</appID>
<applicationID>0027598641</applicationID>
<platform>Android</platform>
<CrashDetails>
<Exceptions>
<Exception>
<CrashID>55</CrashID>
<ExceptionType>NullPointerException</ExceptionType>
<FullException>NullPointerException at line 2</FullException>
<StartDate>01-11-2013 09:52:00</StartDate>
<EndDate>02-11-2013 14:43:13</EndDate>
<AppVersionName>6.1.1.6</AppVersionName>
<stacktrace>NullPointerException at line 2 com.MyCompany.MyApp.MyClass.MyMethod</stacktrace>
<Severity>Critical</Severity>
<OccurrenceCount>9</OccurrenceCount>
</Exception>
<Exception>
<CrashID>56</CrashID>
<ExceptionType>NullPointerException</ExceptionType>
<FullException>NullPointerException at line 2</FullException>
<StartDate>01-11-2013 09:52:00</StartDate>
<EndDate>02-11-2013 14:43:13</EndDate>
<AppVersionName>6.1.1.6</AppVersionName>
<stacktrace>NullPointerException at line 2 com.MyCompany.MyApp.MyClass.MyMethod</stacktrace>
<Severity>Critical</Severity>
<OccurrenceCount>9</OccurrenceCount>
</Exception>
</Exceptions>
</CrashDetails>
</MyApp>
<MyApp1>
<appID>2</appID>
<applicationID>4844354</applicationID>
<platform>Android</platform>
<CrashDetails>
<Exceptions>
<Exception>
<CrashID>55</CrashID>
<ExceptionType>NullPointerException</ExceptionType>
<FullException>NullPointerException at line 2</FullException>
<StartDate>01-11-2013 09:52:00</StartDate>
<EndDate>02-11-2013 14:43:13</EndDate>
<AppVersionName>6.1.1.6</AppVersionName>
<stacktrace>NullPointerException at line 2 com.MyCompany.MyApp.MyClass.MyMethod</stacktrace>
<Severity>Critical</Severity>
<OccurrenceCount>9</OccurrenceCount>
</Exception>
</Exceptions>
</CrashDetails>
</MyApp1>
</app>
</apps>
Thanks for any help you can provide.

Our team typically uses LINQ to XML, which provides a really powerful way to work with XML data (including loading XML from files, parsing XML streams, creating XML document and writing/saving XML to files.)
The following link provides a good overview of LINQ to XML
http://www.dreamincode.net/forums/topic/218979-linq-to-xml/
In addition, you may find “the XML part” section of the following page helpful
http://www.codeproject.com/Articles/24376/LINQ-to-XML
Regards

Personally, I prefer to use LINQ to XML for any XML related stuff in C#. You can create XML Trees, and then persist/serialize those trees to file, XmlWriter, or other types of streams.
But in cases when I need huge performance, I prefer to create XML using StringBuilder class and string concatenation operations.

Get img src from XML CDATA

I am new to C# and Windows Phone development so forgive me if I am missing the obvious:
I would like to display a thumbnail image from an RSS XML feed located at http://blog.dota2.com/feed/. The image is inside a CDATA tag written in HTML. Here is the XML code:
<content:encoded>
<![CDATA[
<p>We celebrate Happy Bear Pun Week a day earlier as Lone Druid joins Dota 2′s cast of heroes.</p> <p><img class="alignnone" title="The irony is that he's allergic to fur." src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" alt="The irony is that he's allergic to fur." width="551" height="223" /></p> <p>Community things:</p> <ul> <li>It’s Gosu’s Monthly Madness tournament finals are tomorrow, March 29th. You don’t want to miss this, we hear it could be more than we can bear.</li> <li>Bear witness to Team Dignitas’ Ultimate Guide to Warding. This should be required teaching in clawsrooms across the globe.</li> <li>Great Explorer Nullf has compiled the eating habits of the legendary Tidehunter in one handy chart. This might give you paws before deciding to head to the beach.</li> </ul> <p>Bear in mind that there will not be an update next week as we will be hibernating during that time.</p> <p>Today’s bearlog is available here.</p> <p> </p> <p>Bear.</p>
]]>
</content:encoded>
I just need the
<img src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" />
so I can use the URL to display the image in my reader app.
I have heard people saying not to use Regex as it is bad practise for parsing HTML. I am creating this as a proof of concept, and don't need to worry about this. I am looking for the quickest way to get this URL for the image, and then call this in my app.
Does anyone have any help?
Thanks in advance,
Tom

Assuming your xml looks like this (which I'm sure it doesn't), and these extensions: http://searisen.com/xmllib/extensions.wiki
<?xml version="1.0" encoding="utf-8"?>
<root xmlns:content='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'>
<content:encoded>
<![CDATA[
<p>We celebrate ...</p>
<p>
<a href="http://media.steampowered.com/apps/dota2/posts/LoneDruid_full.jpg ">
<img class="alignnone" title="The irony is that he's allergic to fur."
src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" />
</a>
</p>
<p>the rest removed</p>
]]>
</content:encoded>
</root>
This will get the image source from the second paragraph - hard coded and ugly, but it was all you needed you said. You will have to give the path to the path/to/content:encoded for it to work, and if it is in a group (aka array) then it will be even more complicated. From my code you can see how to separate out the arrays (see paras):
XElement root = XElement.Load(file) // or .Parse(string)
string html = root.Get("content:encoded", string.Empty).Replace("&nbsp", " ");
XElement xdata = XElement.Parse(string.Format("<root>{0}</root>", html));
XElement[] paras = xdata.GetElements("p").ToArray();
string src = paras[1].Get("a/img/src", string.Empty);
PS this works because the HTML is properly formed, if it isn't, then you'll have to use the HtmlAgilityPack as others have answered. You can use the html returned from the Get("content:emcoded" ...)

You can try this when you are ready to use HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourstring);
var imgLinks = doc.DocumentNode
.Descendants("img")
.Select(n => n.Attributes["src"].Value)
.ToArray();

const string pattern = #"<img.+?src.*?\=.*?""(<?URL>.*?)""";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
var match = regex.Match(myCDataText);
var domain = match.Groups["URL"].Value;

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Consuming data from WikiNews - c#

The software they use has an API but I'm not sure if WikiNews supports it.

Related

C# Webservice response xml displaying as pdf

Retrive the Url from an Html Img Tag

Does Linq to XML queries call the entire document or just what you query?

Reading and writing XML

Get img src from XML CDATA

Categories

Resources