XPath to href attribute & WriteLine the URL [duplicate]

XPath to href attribute & WriteLine the URL [duplicate] - c#

This question already has answers here:
Get a value of an attribute by XPath and HtmlAgilityPack
(3 answers)
Closed 9 years ago.
I understand there are many XPath href questions, but none suit my case or I am a beginner of it and don't know what's wrong with my code. Kindly bear with me if this is silly question.
I have this HTML structure:
<td valign="top">08-Jan-14 16:02</td>
<td valign="top"><span style="cursor:help;" title="Regulatory News Service">RNS</span></td>
<td valign="top">Blocklisting Interim Review</td>
<td valign="top">Company Announcement - General</td>
My code is:
HtmlNodeCollection cols5 = rows[i].SelectNodes(".//td[3]/a[#href]");
Stream writer to write the URL:
sw.WriteLine(cols5[j].InnerText);
The result appears to be Blocklisting Interim Review instead of the URL. Can anyone kindly look into it? I've went through XPath guide and search all over but still can't get the exact answer for my case. Any help would be much appreciated!

You cannot select attribute with XPath. Select a element and then get it's href attribute. Following xpath selects from third table cell a element which has href attribute (yes, predicate just specifies that attribute should exist, it does not selects attribute):
var a = doc.DocumentNode.SelectSingleNode(".//td[3]/a[#href]");
var href = a.Attributes["href"].Value;
Returns
share-regulatory-news.asp?shareprice=BARC&ArticleCode=d6rr2uxo&ArticleHeadline=Blocklisting_Interim_Review

Related

How to remove Only HTML tags in the program [duplicate]

This question already has an answer here:
Retrieving Inner Text of Html Tag C#
(1 answer)
Closed 3 years ago.
I want to remove HTML Tags with some source with C#.
Unfortunately, there are some content like <This is content>
first, I tried to Regex class like that.
Regex.Replace(htmltext,"[\\x00-\\x1f<>:\"/\\\\|?*]" +
"|^(CON|PRN|AUX|NUL|COM[0-9]|LPT[0-9]|CLOCK\\$)(\\.|$)" +
"|[\\. ]$", String.Empty);
but in this case,
"<This is content>" was removed.
so anyone, please tell me how to remove Only HTML Tags in the program.
Thanks regard.

Don't try and parse HTML with Regex. It tends not to go well.
Use a parser, HTML Agility Pack is very popular.
Using HTML agility pack you can simply call InnerText to extract the contents without HTML tags.

C# Xml SelectSingleNode returns null [duplicate]

This question already has answers here:
SelectSingleNode returns null when tag contains xmlNamespace
(4 answers)
Closed 5 years ago.
I have this xml and want to extract the first Country out of the xml
<string xmlns="http://www.webserviceX.NET">
<NewDataSet>
<Table>
<Country>Hong Kong</Country>
<City>Cheung Chau</City>
</Table>
<Table>
<Country>Hong Kong</Country>
<City>Hong Kong Inter-National Airport</City>
</Table>
</NewDataSet>
</string>
here's what I did:
value = xml.DocumentElement.SelectSingleNode("string/NewDataSet/Table[1]/Country").InnerText;
This always throw an exception not set to an instance of object as the selectsinglenode always retursn null. Strange thing is I have already tested this xpath using this and it does return me the node I want.
I have googled to find a solution and found this suggesting that I have to add namespace, here's what I did:
var nsmgr = new XmlNamespaceManager(xml.NameTable);
nsmgr.AddNamespace("string", "http://www.webserviceX.NET");
var node = xml.DocumentElement.SelectSingleNode("string/NewDataSet/Table[1]/Country", nsmgr);
Still I have the same exception. Can someone please let me know what I'm doing wrong here? Thanks :)

Just use XmlNamespaceManager
XmlNamespaceManager namespaces = new XmlNamespaceManager(xdoc.NameTable);
namespaces.AddNamespace("sp", "http://www.webserviceX.NET");
var nodes = xdoc.DocumentElement.SelectSingleNode("//sp:NewDataSet/sp:Table[1]/sp:Country", namespaces);

How do I make this regex stop at the first match? [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
I'm converting a lot of code from legacy to maintainable and I'm creating a list of regex we can use to do all the pages quickly and the same. My regex skills are that of a child running with a knife...its not great. I've looked up a lot of different ways to only find the first set but I can't seem to get it to work. Can anyone solve this specific problem for me?
Here is the regex search and replace I'm using.
regex: (rs.*)\.Fields\[\"(\w+)\"\].Value
replace: $1.GetValue<object>("$2")
Works
code to search: ...rsProducts.Fields["Price"].Value...
result: rsProducts.GetValue<object>("Price")
This, as I want it to, finds the rs (recordset) of something and changes the way that we extract the value to use an extension method.
Does Not Work
code to search: ...rsProducts.Fields["Price"].Value + rsProducts.Fields["Price2"].Value...
result: rsProducts.Fields["Price"].Value + rsProducts.Fields["Price2"].Value
should be: rsProducts.GetValue<object>("Price") + rsProducts.GetValue<object>("Price2")
In this case the search does match 2 distinct instances but instead it matches the entire line. Here's a pic from regexr.com.
// sorry I don't have the reputation to post the image as an image but heres the
Link to Example Image

You're not dealing handling the case for the + between the two.
(rs.*?)\.Fields\[\"(\w+)\"\].Value

Parsing HTML page to extract links [duplicate]

This question already has answers here:
How to extract full url with HtmlAgilityPack - C#
(2 answers)
Closed 8 years ago.
I want to parse html file and extract links from <a> tag. For example I am trying to extract link from following <a> tag.
<a class="thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink" href="http://olx.com.pk/item/honda-civic-exi-2005-IDSkzkt.html#6256e9ac30" title=""> <img class="fleft" src="http://img03.olx.com.pk/images_olxpk/89491775_1_144x108_honda-civic-exi-2005-lahore_rev001.jpg" alt="Honda Civic Exi 2005"> </a>
I use the following regular expression
private const string _LINK_REGEX = "href=\"[a-zA-Z./:&\\d_-]+\"";
But I am unable to extract this url.

You can use:
href=\"[^\"]+\"
Test here

Get text from HTML [duplicate]

This question already has answers here:
How do you convert Html to plain text?
(20 answers)
Closed 1 year ago.
I need a way to get all text from my aspx files.
They may contain javascrip also but I only need this for the HTML code.
Basically I need to extract everything on Text or Value attributes, text within code, whatever...
Is there any parser API available?
Cheers!
Alex

As an alternative, you might consider playing with Linq to XML to strip the interesting stuff out.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

XPath to href attribute & WriteLine the URL [duplicate] - c#

Related

How to remove Only HTML tags in the program [duplicate]

C# Xml SelectSingleNode returns null [duplicate]

How do I make this regex stop at the first match? [duplicate]

Parsing HTML page to extract links [duplicate]

Get text from HTML [duplicate]

Categories

Resources