Why can't I parse this element with htmlagilitypack? - c#

I can't figure out how to parse the following:
-Example webpage I'm trying to parse: http://www.aliexpress.com/item/-/255859073.html
-Information I'm trying to get: "7-days". This is the processing time located in the left column of the shipping table.
-The shipping table becomes visible after clicking on the "Shipping and Payment" tab (which is down the page a bit).
So far I have tried selecting the node with different x-path values:
HtmlAgilityPack.HtmlDocument currentHTML = new HtmlAgilityPack.HtmlDocument();
HtmlWeb webget = new HtmlWeb();
currentHTML = webget.Load("http://www.aliexpress.com/item/-/255859073.html");
string processingTime = currentHTML.DocumentNode.SelectSingleNode("/html/body/div[2]/div[4]/div/div/div[2]/div/div/div[3]/div/div/div/div[2]/table/tbody/tr/td[5]").InnerText;
and also:
string processingTime = currentHTML.DocumentNode.SelectSingleNode("//*[contains(concat( \" \", #class, \" \" ), concat( \" \", \"processing\", \" \" ))]").InnerText;
But I get this error:
System.NullReferenceException was unhandled
Message=Object reference not set to an instance of an object.
I also tried their mobile phone website but they didn't display this information there.
Any idea why this is happening and what I need to do?

Looks like your XPath expression was incorrect. Regardless the element you were trying to parse could be better reached by using its Id attribute. I've modified the XPath expression, and for bonus I've added a Regular Expression that will allow you to cleanly parse the days portion from the text.
System.Text.RegularExpressions.Regex
dayParseRegex = new System.Text.RegularExpressions.Regex(#"(?<days>\d)( days\))$");
HtmlAgilityPack.HtmlDocument currentHTML = new HtmlAgilityPack.HtmlDocument();
HtmlWeb webget = new HtmlWeb();
currentHTML = webget.Load("http://www.aliexpress.com/item/-/255859073.html");
//Extract node
var handlingTimeNode = currentHTML.DocumentNode.SelectSingleNode("//*[#id=\"product-info-shipping-sub\"]");
//Run RegEx against text
var match = dayParseRegex.Match(handlingTimeNode.InnerText);
//Convert the days to an integer from the resultant group
int shippingDays = Convert.ToInt32(match.Groups["days"].Value);
Talk about coding and gettin' paid! Now go rip the hell outta that site!

Related

XMLException. List of the all invalid characters

I try execute such a code sample.
var xmlDocument = new XmlDocument();
documentTagName = "testName)"
XmlNode headerElement = xmlDocument.CreateElement(documentTagName);
Of cource I get XmlException:
The ')' character, hexadecimal value 0x... (doesn't matter), cannot be included in a name
Because I have ) symbol in documentTagName. And of cource I'll get the same exception if documentTagName would be like this:
documentTagName = "testName("
or like this:
documentTagName = "testName:"
Because all of these characters ('(' , ')' , ':') are invalid for the xml tag name. But I check many links (and even this) and cannot find the list of all invalid characters for xml tag name. Can anybody help me?

Get colored texts within HTML code

I have a Html code and I want to Convert it to plain text but keep only colored text tags.
for example:
when I have below Html:
<body>
This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>
I want the output:
this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...
I'd use parser to parse HTML like HtmlAgilityPack, and use regular expressions to find the color value in attributes.
First, find all the nodes that contain style attribute with color defined in it by using xpath:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
.SelectNodes("//*[contains(#style, 'color')]")
.ToArray();
Then the simplest regex to match a color value: (?<=color:\s*)#?\w+.
var colorRegex = new Regex(#"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);
Then iterate through these nodes and if there is a regex match, replace the inner html of the node with html encoded tags (you'll understand why a little bit later):
foreach (var node in nodes)
{
var style = node.Attributes["style"].Value;
if (colorRegex.IsMatch(style))
{
var color = colorRegex.Match(style).Value;
node.InnerHtml =
HttpUtility.HtmlEncode("<" + color + ">") +
node.InnerHtml +
HttpUtility.HtmlEncode("</" + color + ">");
}
}
And finally get the inner text of the document and perform html decoding on it (this is because inner text strips all the tags):
var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
This should return something like this:
This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...
Of course you could improve it for your needs.
It is possible to do it using regular expressions but... You should not parse (X)HTML with regex.
The first regexp I came with to solve the problem is:
<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>
Group 5th will be the hex (3 or 6 hexadecimals) colour and group 6th will be the text inside the tag.
Obviously, it's not the best solution as I'm not a regexp master and obviously it needs some testing and probably generalisation... But still it's a good point to start with.

Escaping ONLY contents of Node in XML

I have a part of code mentioned like below.
//Reading from a file and assign to the variable named "s"
string s = "<item><name> Foo </name></item>";
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
But, it stops working if the contents has characters something like "<", ">"..etc.
string s = "<item><name> Foo > Bar </name></item>";
I know, I have to escape those characters before loading but, if I do like
doc.LoadXml(System.Security.SecurityElement.Escape(s));
, the tags (< , >) are also escaped and as a result, the error occurs.
How can I solve this problem?
a tricky solution:
string s = "<item><name> Foo > Bar </name></item>";
s = Regex.Replace(s, #"<[^>]+?>", m => HttpUtility.HtmlEncode(m.Value)).Replace("<","ojlovecd").Replace(">","cdloveoj");
s = HttpUtility.HtmlDecode(s).Replace("ojlovecd", ">").Replace("cdloveoj", "<");
XmlDocument doc = new XmlDocument();
doc.LoadXml(s);
Assuming your content will never contain the characters "]]>", you can use CDATA.
string s = "<item><name><![CDATA[ Foo > Bar ]]></name></item>";
Otherwise, you'll need to html encode your special characters, and decode them before you use/display them (unless it's in a browser).
string s = "<item><name> Foo > Bar </name></item>";
Assign the content of string to the InnerXml property of node.
var node = doc.CreateElement("root");
node.InnerXml = s;
Take a look at - Different ways how to escape an XML string in C#
It looks like the strings that you have generated are strings, and not valid XML. You can either get the strings generated as valid XML OR if you know that the strings are always going to be the name, then don't include the XML <item> and <name> tags in the data.
Then when you create the XMLDocument. do a CreateElement and assign your string before resaving the results.
XmlDocument doc = new XmlDocument();
XmlElement root = doc.CreateElement("item");
doc.AppendChild(root);
XmlElement name = doc.CreateElement("name");
name.InnerText = "the contents from your file";
root.AppendChild(name);

Validating HTML Tags in a String in C#

Assume that we have the following HTML strings.
string A = " <table width=325><tr><td width=325>test</td></tr></table>"
string B = " <<table width=325><tr><td width=325>test</td></table>"
How can we validate A or B in C# according to HTML specifications?
A should return true whereas B should return false.
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Checking a HTML string for unopened tags
One point to start with is checking if it's valid XML.
by the way, I think both your examples are incorrect as you've left out the </tr> from both.
http://web.archive.org/web/20110820163031/http://markbeaton.com/SoftwareInfo.aspx?ID=81a0ecd0-c41c-48da-8a39-f10c8aa3f931
Github link: https://github.com/markbeaton/TidyManaged
This guy has written a .NET wrapper for HTMLTidy. I haven't used it but it may be what you are looking for.

c# parsing xml with and apostrophe throws exception

I am parsing an xml file and am running into an issue when trying find a node that has an apostrophe in it. When item name does not have this everything works fine. I have tried replacing the apostrophe with different escape chars but am not having much luck
string s = "/itemDB/item[#name='" + itemName + "']";
// Things i have tried that did not work
// s.Replace("'", "''");
// .Replace("'", "\'");
XmlNode parent = root.SelectSingleNode(s);
I always receive an XPathException. What is the proper way to do this. Thanks
For apostophe replace it with &apos;
You can do it Like this:
XmlDocument root = new XmlDocument();
root.LoadXml(#"<itemDB><item name=""abc'def""/></itemDB>");
XmlNode node = root.SelectSingleNode(#"itemDB/item[#name=""abc'def""]");
Note the verbatim string literal '#' and the double quotes.
Your code would then look like this and there is no need to replace anything:
var itemName = #"abc'def";
string s = #"/itemDB/item[#name=""" + itemName + #"""]";

Categories

Resources