XML parsing : Reading CDATA - c#

<item><title>this is title</title><guid isPermaLink="true">http://www.i.com/video/nokia-lumia-920-deki-pureview_2879.html</guid><link>http://www.i.com/video/nokia-lumia-920-deki-pureview_2879.html</link>
<description><![CDATA[this is the info.]]></description>
<pubDate>Wed, 5 Sep 2012 22:10:00 UT</pubDate>
<media:content type="image/jpg" expression="sample" fileSize="2956" medium="image" url="http://media.chip.com.tr/images/content/video/88/201209060102428081-0.jpg"/>
<enclosure type="image/jpg" url="http://media.chip.com.tr/images/content/video/88/201209060102428081-0.jpg" length="2956"/></item>
I want read the CDATA in <"description">
I wrote this
var x = e.Result;// e is downlaoded xml file
var videos = XElement.Parse(e.Result);
var fList = (from haber in videos.Descendants("channel").Elements("item")
select new Video
{
title = haber.Element("title").Value,
link = haber.Element("link").Value,
//description = ???????
}).ToList();
what should i write to description ? //EDIT Answer: The same way
but if the description like this?
<![CDATA[<p>Zombiler adına ne umduk ne bulduk!</p> <p> </p><p><img style="margin: 5px 0px 5px 5px; border: 1px solid #333333; float: right;" alt="Black_ops" src="http://or.com/images/stories/haber/haberler6/20120918_Castlevania/Black_ops.jpg" height="0" width="0" /><strong>Black Ops 2</strong>'de Zombi modu olabilir haberi çıktığından beri bir ses, bir görüntü beklerken <strong>Call of Duty</strong>'nin resmi <strong>Youtube</strong> sayfasında aşağıdaki video yayınlandı. Açıkçası ne demek istiyorlar anlamak güç. <p>Devamını oku...</p>]]>

You should be able to use exactly the same code:
description = haber.Element("description").Value
Or
description = (string) haber.Element("description")
LINQ to XML will take care of reading the text for you.

To read the CDATA block you just use the same methods; you what you want is to clean the HTML from it, then check this answer.

Related

find link with multiple keywords in c# with HTML Agility Pack

I am writing a program that parse a website.
I manage to find a link in the website, but I needed to pass the exact Innertext words to find it.
I'm looking for a way to do the same thing but to find it by partial inner text
example:
innertext is: "hi my name is"
I want to be able to find it by putting only
"hi my"
foreach (var title in htmlNodes)
{
if (keywords == title.SelectSingleNode("div/h1").InnerText)
{
if (color == title.SelectSingleNode("div/p").InnerText)
{
Console.WriteLine(title.SelectSingleNode("div/p/a").GetAttributeValue("href", "pas d'addresse"));
}
}
}
here keywords need to match exactly the innertext in div/h1. I want it to be partial.
here is the html code :
<article>
<div class="inner-article">
<a style = "height:150px;" href="/shop/shirts/c712g63kx/p1us9bkh7">
<img width = "150" height="150" src="//assets.supremenewyork.com/146319/vi/qW2Nur88W30.jpg" alt="Qw2nur88w30">
</a>
<h1>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Tiger Stripe Rayon Shirt</a>
</h1>
<p>
<a class="name-link" href="/shop/shirts/c712g63kx/p1us9bkh7">Teal</a>
</p>
</div>
</article>
thank you all for your answers!
I found out how to resolve my problem. It was actually quite simple. here is the code:
if ((title.SelectSingleNode("div/h1").InnerText).Contains(keywords))
Now the problem is to do it with case insensitive.

Add space within foreach elements

i have two foreach operations nested like below.
#foreach (var post in Model.Posts)
{
<article class="post-#post.PostId #post.PostType sticky post-item isotope-item
#foreach (var category in post.Categories)
{
#category.FormattedCategoryName
}">
}
Here's a sample of the output data:
<article class="post-1024 format-standard sticky post-item isotope-item cat1cat2cat3cat4cat5" style="width: 429px; position: absolute; left: 0px; top: 0px; transform: translate3d(2px, 1px, 0px);">
The only thing wrong is I could not separate #category.FormattedCategoryName with blank spaces. It might be an easy string operation but how? Any idea?
Thanks a lot.
Try below instead:
#category.FormattedCategoryName<text> </text>
Or Alternately
#Html.Raw(string.Contact(category.FormattedCategoryName, " "))
Edit:
As per the #freedomn-m comment the proposed solution should be replace the foreach loop with below:
#string.Join(" ", post.Categories.Select(c => c.FormattedCategoryName).ToArray())
So overall structure would be:
<article class="post-#post.PostId #post.PostType sticky post-item isotope-item
#string.Join(" ", post.Categories.Select(c => c.FormattedCategoryName).ToArray())">
Hope this will help !!

How to check if an XML attribute contains a string?

Here is the XML (I have saved an html page in xml form to parse it generically:
<td width="76" class="DataB">2.276</td>
<td width="76" class="DataB">2.289</td>
<td width="76" class="DataB">2.091</td>
<td width="76" class="DataB">1.952</td>
<td width="76" class="DataB">1.936</td>
<td width="76" class="Current2">1.899</td>
Now I am trying to find all of the elements that contain the string Current because the web page changes the number on the back:
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
This returns an object does not exist error here:
((string) element.Attribute("class"))
How can I check an attribute if it contains something?
If you asked me, it would be easier to write as an xpath query. This way you don't have to deal with cases where elements doesn't contain class attributes and other such cases.
var query = xml.XPathSelectElements("//td[contains(#class,'Current')]");
Otherwise, you would have to check for the existence of the attribute before trying to read it.
// query syntax makes this a little nicer
var query =
from td in xml.Descendants("td")
let classStr = (string)td.Attribute("class")
where classStr != null && classStr.Contains("Current")
select td;
// or alternatively, provide a default value
var query =
from td in xml.Descendants("td")
where ((string)td.Attribute("class") ?? "").Contains("Current")
select td;
There's probably something wrong with the XML input you're using - trying this code works for me in LINQPad:
XDocument xml = XDocument.Parse(#"<tr><td width=""76"" class=""DataB"">2.276</td>
<td width=""76"" class=""DataB"">2.289</td>
<td width=""76"" class=""DataB"">2.091</td>
<td width=""76"" class=""DataB"">1.952</td>
<td width=""76"" class=""DataB"">1.936</td>
<td width=""76"" class=""Current2"">1.899</td></tr>");
var xElements = xml.Descendants("td").Where(element => ((string) element.Attribute("class")).Contains("Current"));
xElements.Dump();
Are you sure your XML is valid?

Extract Content from <div class=" "> </div> Tag C# RegEx

I have a code`
string tag = "div";
string pattern = string.Format(#"\<{0}.*?\>(?<tegData>.+?)\<\/{0}\>", tag.Trim());
Regex regex = new Regex(pattern, RegexOptions.ExplicitCapture);
MatchCollection matches = regex.Matches(data);
`
and i need to get content between <div class="in"> .... </div> tags
<div class="in">
ВАЗ 2121 <span class="for">за</span> <span class="price">2 700 $</span></span><br/><span class="year">1990 г.</span><br/><div style="margin: 3px 0 3px 0">1.6 л, бензин, КПП механика, с пробегом, белый, литые диски, тонировка, спойлер, ветровики, противотуманки, Движок после капитального ремонта!</div><div>
<span style="display:block; padding: 4px 0 0 0;"><span class="region">Костанай</span><span class="adv-phones">, +7 (777) 4464451</span></span>
<small class="gray air">24 просмотра</small>
<small class="gray air">13 июня</small>
</div>
<div class="selectItem" title="Выбрать" id="fv_sic_7184569">
</div>
</div>
How can I do it?
My code doesn't work.
Here's a regex that might extract simple div tags:
// <div[^>]*>(.+?)</div>
string tag = "div";
string pattern = string.Format(#"<{0}[^>]*>(?<tegData>.+?)</{0}>", tag.Trim());
However, using RegEx for HTML parsing is almost always inappropriate and guaranteed to not work properly. That is simply because markup languages such as HTML are not regular languages.
That being said you would be much better off using an XML parser to parse the document or fragment and then extract what you need. In fact, using a forward-only parser would probably even be faster than trying to use RegEx.
You should look at the XmlReader class in .NET.
If it doesn't have to be Server Side you could use some JavaScript to make this happen. Such as:
<script language="javascript">
function getData(){
var divs = document.getElementByTagName('div');
var data;
var x;
for(x = 0; x < divs.length; x++)
{
if(divs[x].className == 'in')
{
data = divs[x].innerHTML;
}
}
}
</script>
To get nested tags try use this function:
public static MatchCollection ParseTag(string str, string tagpat, string argpat, string valpat) {
if (null == tagpat) argpat = #"\w+";
if (null == argpat) argpat = #"[^>]*";
if (null == valpat) valpat = #"(?><\k'tag'\b[^>]*>(?'nst')|</\k'tag'>(?'-nst')|.?)*?(?(nst)(?!))";
return Regex.Matches(str, #"(?><(?'tag'" + tagpat + #"\b)\s*(?'arg'" + argpat + #")>)(?'val'" + valpat + #")</\k'tag'>",
RegexOptions.IgnoreCase | RegexOptions.Singleline);
}
Parameters are simple regexes to filter the target tag, here are examples:
ParseTag(page, "div", #"id=""content""\s+class=""mw-body""", null);
ParseTag(wikipage, "span", #"class=""bday""", #"\d{4}-\d{2}-\d{2}");
This variant handles opening and closing tags and nested tags of the same type (other nested tags can be broken and ignored).
The other variant checks nested tags more strict and does not match if some of them are mis-opened or closed:
if (null == valpat) valpat = #"(?><(?'itag'\w+)\b[^>]*>(?'nst')|</\k'itag'>(?'-nst')|.?)*?(?(nst)(?!))";
It much easier for me to use XPath. Maybe you will find it useful.
textBox2.Text = "<div style=\"padding: 5px; width: 212px\"><div>more text</div></div>";
string x = "//div[contains(#style,'padding: 5px; width: 212px;')]";
XmlDocument doc = new XmlDocument();
doc.LoadXml(textBox2.Text);
XmlNodeList nodes = doc.SelectNodes(textBox1.Text);
foreach(XmlNode node in nodes)
{
textBox3.Text = node.InnerXml;
}
Code that worked for me for RegEx would find the first inner div.
string r = #"<div style=""padding: 5px; width: 212px;";
Regex rg = new Regex(r);
var matches = rg.Matches(s);
if (matches.Count > 0)
{
foreach (Match m in matches)
{
textBox3.Text += m.Groups[1];
}
}

Regex Grouping in C#

I have multiple p tags in a HTML code.
<p class=MsoNormal><b style='mso-bidi-font-weight:normal'><span
style='font-size:7.0pt'>PA<span style='mso-spacerun:yes'> </span>ARALIĞI</span></b><span
style='font-size:7.0pt'> [İng. <b style='mso-bidi-font-weight:normal'>PA
interval</b>]. (<i style='mso-bidi-font-style:normal'>Kardiyoloji</i>).
Atriyum’un P dalgasının başlangıcını ayıran mesafe. İntraatriyal ya da
sino-nodal iletim süresinin (35-45 milisaniye) ölçümünü verir. Uzaması ileti
bozukluğunun göstergesidir. <o:p></o:p></span></p>
<p class=MsoNormal><b style='mso-bidi-font-weight:normal'><span
style='font-size:7.0pt'>PA<span style='mso-spacerun:yes'> </span>ARALIĞI</span></b> <span
style='font-size:7.0pt'> [İng. <b style='mso-bidi-font-weight:normal'>PA
interval</b>]. (<i style='mso-bidi-font-style:normal'>Kardiyoloji</i>).
Atriyum’un P dalgasının başlangıcını ayıran mesafe. İntraatriyal ya da
sino-nodal iletim süresinin (35-45 milisaniye) ölçümünü verir. Uzaması ileti
bozukluğunun göstergesidir. <o:p></o:p></span></p>
How can I get them in a List as different indexes. I need to take each p as a member in the list. My code is :
Regex Rx = new Regex(#"<p(.*)>(.*)<\/p>",RegexOptions.Multiline);
MatchCollection mc = Rx.Matches(yazi);
Thanks
Is a really bad idea to parse HTML with regular expressions. The syntax of HTML is too complex.
Use an HTML parser instead: Looking for C# HTML parser

Categories

Resources