Get img src from XML CDATA - c#

I am new to C# and Windows Phone development so forgive me if I am missing the obvious:
I would like to display a thumbnail image from an RSS XML feed located at http://blog.dota2.com/feed/. The image is inside a CDATA tag written in HTML. Here is the XML code:
<content:encoded>
<![CDATA[
<p>We celebrate Happy Bear Pun Week a day earlier as Lone Druid joins Dota 2′s cast of heroes.</p> <p><img class="alignnone" title="The irony is that he's allergic to fur." src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" alt="The irony is that he's allergic to fur." width="551" height="223" /></p> <p>Community things:</p> <ul> <li>It’s Gosu’s Monthly Madness tournament finals are tomorrow, March 29th. You don’t want to miss this, we hear it could be more than we can bear.</li> <li>Bear witness to Team Dignitas’ Ultimate Guide to Warding. This should be required teaching in clawsrooms across the globe.</li> <li>Great Explorer Nullf has compiled the eating habits of the legendary Tidehunter in one handy chart. This might give you paws before deciding to head to the beach.</li> </ul> <p>Bear in mind that there will not be an update next week as we will be hibernating during that time.</p> <p>Today’s bearlog is available here.</p> <p> </p> <p>Bear.</p>
]]>
</content:encoded>
I just need the
<img src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" />
so I can use the URL to display the image in my reader app.
I have heard people saying not to use Regex as it is bad practise for parsing HTML. I am creating this as a proof of concept, and don't need to worry about this. I am looking for the quickest way to get this URL for the image, and then call this in my app.
Does anyone have any help?
Thanks in advance,
Tom

Assuming your xml looks like this (which I'm sure it doesn't), and these extensions: http://searisen.com/xmllib/extensions.wiki
<?xml version="1.0" encoding="utf-8"?>
<root xmlns:content='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882'>
<content:encoded>
<![CDATA[
<p>We celebrate ...</p>
<p>
<a href="http://media.steampowered.com/apps/dota2/posts/LoneDruid_full.jpg ">
<img class="alignnone" title="The irony is that he's allergic to fur."
src="http://media.steampowered.com/apps/dota2/posts/LoneDruid_small.jpg" />
</a>
</p>
<p>the rest removed</p>
]]>
</content:encoded>
</root>
This will get the image source from the second paragraph - hard coded and ugly, but it was all you needed you said. You will have to give the path to the path/to/content:encoded for it to work, and if it is in a group (aka array) then it will be even more complicated. From my code you can see how to separate out the arrays (see paras):
XElement root = XElement.Load(file) // or .Parse(string)
string html = root.Get("content:encoded", string.Empty).Replace("&nbsp", " ");
XElement xdata = XElement.Parse(string.Format("<root>{0}</root>", html));
XElement[] paras = xdata.GetElements("p").ToArray();
string src = paras[1].Get("a/img/src", string.Empty);
PS this works because the HTML is properly formed, if it isn't, then you'll have to use the HtmlAgilityPack as others have answered. You can use the html returned from the Get("content:emcoded" ...)

You can try this when you are ready to use HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(yourstring);
var imgLinks = doc.DocumentNode
.Descendants("img")
.Select(n => n.Attributes["src"].Value)
.ToArray();

const string pattern = #"<img.+?src.*?\=.*?""(<?URL>.*?)""";
Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
var match = regex.Match(myCDataText);
var domain = match.Groups["URL"].Value;

Related

Parsing HTML to get the key and value

I use HTMLAgility to parse HTML Document.
I downloaded the dll from codeplex and referenced it to my project.
Now, all my need is to parse this HTML (below):
<HTML>
<BODY>
//......................
<tbody ID='image'>
<tr><td>Video Codec</td><td colspan=2>JPEG (8192 KBytes)</td></tr>
</BODY>
Now, I need to retrieve Video Codec and its value JPEG from the above HTML.
I know that I can use HTMLAgility but how to do that?
var document = new HtmlDocument();
string htmlString = "<tbody ID='image'>";
document.LoadHtml(htmlString);
// how to get the Video Codec and its value `JPEG` ?
Any pointers is much appreciated.
EDIT:
I was able to proceed from #itedi 's answer to a bit but still stuck up.
var cells = document.DocumentNode
// use the right XPath rather than looping manually
.SelectNodes(#"//table")
.ToList();
var tbodies = cells.First().SelectNodes(#"//tbody").ToList();
gives me all the tbody's but how to print the values from it ?
A much lighter way would be using regex:
string s = #"<tbody ID='image'>
<tr><td>Video Codec</td><td colspan=2>JPEG (8192 KBytes)</td></tr>
</BODY>";
var results = Regex.Match(s, "<td>Video Codec</td><td.*?>(.+?)</td>").Groups[1];
Returns:
JPEG (8192 KBytes)

Get value between html tags Xpath and HtmlAgility

So Far I am trying to retrieve the text between HTML tags for a certain website....
Say for instance I need to extract out the text between these span tags how would I go about that, I am receiving an error stating "the object reference not set to an instance of an object" here is the HTML
There is also HTML Code prior to this portion here; I don't know if that should make a difference.
<div class="thumbnail-details">
<ul>
<li> … </li>
<li class="product-title">
<span class="thumbnail-details-grey">The Blaster Portable Wireless Speaker in Black</span>
</li>
<li> … </li>
</ul>
</div>
So far my C# code is
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(#"http://www.karmaloop.com/Browse.htm#Pgroup=1");
if (htmlDoc.DocumentNode != null)
{
foreach (HtmlNode text in htmlDoc.DocumentNode.SelectNodes("//span[#class='thumbnail-details-grey']/text()"))
{
Console.WriteLine(text.InnerText);
}
Can I get some help here, I want to extract out "The Blaster Portable Wireless Speaker in Black".
I'd recommend using CsQuery (https://www.nuget.org/packages/CsQuery/1.3.4) and then it's as simple as:
var doc = CQ.CreateFromUrl(#"http://www.karmaloop.com/Browse.htm");
var nodes = doc.Find("span.thumbnail-details-grey");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
Your code works just fine, but you'll have to load the right page to get it to work. The page you are loading uses an ajax request to load the results you see in your browser.
So instead of the url you are currently using you have to use:
HtmlDocument htmlDoc = hw.Load(#"http://www.karmaloop.com/Browse?Pgroup=1&ajax=true&version=2");
Then your code works. I'm still looking for the place this request gets put together...
But the query looks rather easy to guess. For example the page http://www.karmaloop.com/Browse.htm#Pdept=11&PageSize=30&Pgroup=1 request the url http://www.karmaloop.com/Browse?Pdept=11&PageSize=30&Pgroup=1&ajax=true&version=2. So all you have to do is use your url and build a new one starting after the #.

How do I store html page in a xml file?

I have a small application written in c# as a console app that I want to use to send an email. I was planning on storing the email inside an xml file along with other information that the message will need like a subject. However there seems to be a problem because the XML file doesnt like </br> characters.
Im wondering what I should do in order to store a html email do I just have to keeo the body html in a seperate html file and then read each line into a StreamReader object?
The easiest way would be to store the HTML content in a CDATA section:
<mail>
<subject>Test</subject>
<body>
<![CDATA[
<html>
...
</html>
]]>
</body>
</mail>
Use a CDATA section, that will contain your email HTML code :
<?xml version="1.0"?>
<myDocument>
<email>
<![CDATA[
<html>
<head><title>My title</title></head>
<body><p>Hello world</p></body>
</html>
]]>
</email>
</myDocument>
You can use CDATA section in your XML - here you can read about it.
You could store the HTML as CDATA within the XML.
But looking at what you are trying to do, you may wish instead look at the System.Web.UI.WebControls.MailDefinition class, as it already contains a reasonable way of using mail templates.
The msdn documentation gears towards it being used in WinForms apps, but you can simply use a ListDictionary to fill the replacements.
Here is a simplistic example, to give an idea of how MailDefinition can be used, I won't go into to much detail, as it's a little outside of the scope of the original question.
protected MailMessage GetNewUserMailMessage(string email, string username, string password, string loginUrl)
{
MailDefinition mailDefinition = new MailDefinition();
mailDefinition.BodyFileName = "~/mailtemplates/newuser.txt";
ListDictionary replacements = new ListDictionary();
replacements.Add("<%username%>", username);
replacements.Add("<%password%>", password);
replacements.Add("<%loginUrl%>", loginUrl);
return mailDefinition.CreateMailMessage(email, replacements, this);
}

What is the best way to wrap some text in an xml tag?

I am trying to use Regex in C# to match a section in an xml document and wrap that section inside of a tag.
For example, I have this section:
<intro>
<p>this is the first section of content</p>
<p> this is another</p>
</intro>
and I want it to look like this:
<intro>
<bodyText>
<p> this is asdf</p>
<p> yada yada </p>
</bodyText>
</intro>
any thoughts?
I was considering doing it using the XPath class in C# or just by reading in the document and using Regex. I just can't seem to figure it out either way.
here is the one try:
StreamReader reader = new StreamReader(filePath);
string content = reader.ReadToEnd();
reader.Close();
/* The regex stuff would go here */
StreamWriter writer = new StreamWriter(filePath);
writer.Write(content);
writer.Close();
}
Thanks!
I wouldn't recommend regular expressions for this task. Instead you can do it using LINQ to XML. For example, here is how you could wrap some tags inside a new tag:
XDocument doc = XDocument.Load("input.xml");
var section = doc.Root.Elements("p");
doc.Root.ReplaceAll(new XElement("bodyText", section));
Console.WriteLine(doc.ToString());
Result:
<intro>
<bodyText>
<p>this is the first section of content</p>
<p> this is another</p>
</bodyText>
</intro>
I assume that your actual document differs considerably from the example you posted so the code will need some adjustment to fit your requirements, but if you read the documentation for XDocument you should be able to do what you want.
I would suggest the use of System.XML and XPath - I don't think XML is considered a regular language similar to HTML which causes issues when trying to parse it with Regular expressions.
Use something like
XMLDocument doc = new XMLDocument();
doc.Load("Path to your xml document");
Enjoy!

Consuming data from WikiNews

I have been scouring the net but I can't seem to find any examples of consuming data from WikiNews. They have an RSS feed with links to individual stories as HTML, but I would like to get the data in a structured format such as XML etc.
By structured format I mean an XML file for each story that has a defined XML schema (XSD) file. See: [http://www.w3schools.com/schema/schema_intro.asp][2]
Has anyone written a program that consumes stories from WikiNews? Do they have a documented API?
I would like to use C# to collect selected stories and store them in SQL Server 2008.
[2]: By "structured format" I mean something like an XML schema (XSD) file. See: http://www.w3schools.com/schema/schema_intro.asp
The software they use has an API but I'm not sure if WikiNews supports it.
Their feed: http://feeds.feedburner.com/WikinewsLatestNews
If you put that in your browser and read the source, you'll see that it is XML. The XML contains the title, description, a link, etc. Only the description is in HTML.
Here is the beginning of the response:
<rss xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
<channel>
<title>Wikinews</title>
<description>Wikinews RSS feed</description>
<language>en</language>
<link>http://en.wikinews.org</link>
<copyright>Creative Commons Attribution 2.5 (unless otherwise noted)</copyright>
<generator>Wikinews Fetch</generator>
<ttl>180</ttl>
<docs>http://none</docs>
<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/WikinewsLatestNews" /><feedburner:info uri="wikinewslatestnews" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><creativeCommons:license>http://creativecommons.org/licenses/by/2.5/</creativeCommons:license><feedburner:browserFriendly>This is an XML content feed. It is intended to be viewed in a newsreader or syndicated to another site.</feedburner:browserFriendly><item>
<title>Lufthansa pilots begin strike</title>
<link>http://feedproxy.google.com/~r/WikinewsLatestNews/~3/1K2xloPGlmI/Lufthansa_pilots_begin_strike</link>
<description><p><a href="http://en.wikinews.org/w/index.php?title=File:LocationGermany.png&filetimestamp=20060604120306" class="image" title="A map showing the location of Germany"><img alt="A map showing the location of Germany" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/de/LocationGermany.png/196px-LocationGermany.png" width="196" height="90" /></a></p>
<p><b class="published"><span id="publishDate" class="value-title" title="2010-02-22"></span>Monday, February 22, 2010</b></p>
<p>The pilot's union of <a href="http://en.wikinews.org/wiki/Germany" title="Germany" class="mw-redirect">German</a> airline <a href="http://en.wikipedia.org/wiki/Lufthansa" class="extiw" title="w:Lufthansa">Lufthansa</a> have begun a four-day strike over pay and job security. Operations at subsidiary airlines <a href="http://en.wikipedia.org/wiki/Lufthansa_Cargo" class="extiw" title="w:Lufthansa Cargo">Lufthansa Cargo</a> and <a href="http://en.wikipedia.org/wiki/Germanwings" class="extiw" title="w:Germanwings">Germanwings</a> are also affected by the strike.</p>
<em><a href='http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike'>More...</a></em><div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/WikinewsLatestNews?a=1K2xloPGlmI:9SJI0YV04-M:YwkR-u9nhCs"><img src="http://feeds.feedburner.com/~ff/WikinewsLatestNews?d=YwkR-u9nhCs" border="0"></img></a>
</div></description>
<guid isPermaLink="false">http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike</guid>
<feedburner:origLink>http://en.wikinews.org/wiki/Lufthansa_pilots_begin_strike</feedburner:origLink></item>
Your question is really unclear! but I guess you want to format the feeds of WikiNews, to be readable in a more friendly way (as if you are reading it in WikiNews itself), am I correct?
If so, then you have to know that RSS are XML with a standard format, and not related to WikiNews, and you can transform any RSS feeds to be displayed in -say- HTML with XSLT.
If you need to get the story itself, you can use the given link in the feed, and display it in a webbrowser control (if you are developing a windows application).
Do you need something else other than what I have said?

Categories

Resources