Get the value of an HTML element - c#

I have the HTML code of a webpage in a text file. I'd like my program to return the value that is in a tag. E.g. I want to get "Julius" out of
<span class="hidden first">Julius</span>
Do I need regular expression for this? Otherwise what is a string function that can do it?

You should be using an html parser like htmlagilitypack .Regex is not a good choice for parsing HTML files as HTML is not strict nor is it regular with its format.
You can use below code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//span[#class='hidden first']")//this xpath selects all span tag having its class as hidden first
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the span tags content having its class as hidden first

I would use the Html Agility Pack to parse the HTML in C#.

I'd strongly recommend you look into something like the HTML Agility Pack

i've asked the same question few days ago and ened up using HTML Agility Pack, but here is the regular expressions that you want
this one will ignore the attributes
<span[^>]*>(.*?)</span>
this one will consider the attributes
<span class="hidden first"[^>]*>(.*?)</span>

Related

Getting multiple values from a string with delimiters [duplicate]

I have this from html page source
<h5 class="icn-venue">Tavernita</h5>
There are say 10 values like this between these tags on the page source.
I want to extract value between "h5" tags. Class="icn-venue" remains same for all values.
I tried splitting the tag and then storing but the code doesnt seem to work.
You can do it like this using htmlAgilityPack:
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
List<string> lst=doc.DocumentNode.SelectNodes("//h5[class='icn-venue']")
.Select(x=>x.InnerHtml)
.ToList();
HTML Agility Pack is a great tool for manipulating and working with HTML: http://htmlagilitypack.codeplex.com/
It could at least make grabbing the values you need and doing the replaces a little easier.
Contains links to using the HTML Agility Pack: How to use HTML Agility pack

c# parse html using XPathDocument

i'm trying to parse an html page with XPathDocument, but gives error 'cause the html is not an xml...
is there a way to do this or not?
Should use HtmlAgilityPack. Still the best!
Use something like Html Agility Pack which can load your html into a DOM object which can be traversed with for example xpath queries.
Unless your html is in fact xhtml, it is usually not a valid xml structure with correct opening and ending node tags.

Extract action attribute in a Form tag with Regex in C#?

I wanna extract https://www.sth.com/yment/Paymentform.aspx from below string
<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>
How can I do it with Regex or somthing ?
While I don't encourage using regex to parse HTML, this is simple enough that a regex will suffice. For more complex operations, do use a proper (X)HTML parser like HtmlAgilityPack.
This regex should work:
<\s*form[^>]*\s+action=(["'])(.*?)\1
EDIT:
Updated regex so it will work with apostrophes in URLs. Note that the URL is now in the 2nd capture group.
See it on rubular
Use Html Agility Pack. It will save you a lot of trouble in the long run.
using HtmlAgilityPack;
var doc = new HtmlDocument();
doc.LoadHtml("<form id='paymentUTLfrm' action='https://www.sth.com/yment/Paymentform.aspx' method='post'>");
var form = doc.DocumentNode.SelectSingleNode("id('paymentUTLfrm')");
string action = form.Attributes["action"].Value;
It supports loading pages directly from the web, as well as XPath (used above). The HTML does not have to be valid.
EDIT: If you want to use the name:
doc.DocumentNode.SelectSingleNode("//*[#name='paymentUTLfrm']");
While I would agree that general html parsing is best done with html agility pack (etc) rather than with regex, this is a pretty simple requirement and a regex would be appropriate. I am no regex expert, but this one works:
action=["'](.*)["']
The (.*) will capture the url
maybe some expert can add a comnent to refine this...

Can I use Html Agility Pack To Parse HTML Fragment?

Can Html Agility Pack be used to parse an html string fragment?
Such As:
var fragment = "<b>Some code </b>";
Then extract all <b> tags? All the examples I seen so far have been loading like html documents.
If it's html then yes.
string str = "<b>Some code</b>";
// not sure if needed
string html = string.Format("<html><head></head><body>{0}</body></html>", str);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// look xpath tutorials for how to select elements
// select 1st <b> element
HtmlNode bNode = doc.DocumentNode.SelectSingleNode("b[1]");
string boldText = bNode.InnerText;
I dont think this is really the best use of HtmlAgilityPack.
Normally I see people trying to parse large amounts of html using regular expressions and I point them towards HtmlAgilityPack but in this case I think it would be better to use a regex.
Roy Osherove has a blog post describing how you can strip out all the html from a snippet:
http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx
Even if you did get the correct xpath with Mika Kolari's sample this would only work for a snippet with a <b> tag in it and would break if the code changed.
This answer came up when I searched for the same thing. I don't know if the features have changed since it was answered but this below should be better.
$string = '<b>Some code </b>'
[HtmlAgilityPack.HtmlNode]::CreateNode($string)

Inbuilt Regex class or Parser.How to extract text between the tags from html file?

I have html file in which there is table content and other information in my c#.net application.
I want to parse the table contents for only some columns.Then should I use parser of html or Replace method of Regex in .net ?
And if I use the parser then how to use parser? Will parser extract the inforamation which is between the tags? If yes then how to use ? If possible show the example because I am new to parser.
If I use Replace method of Regex class then in that method how to pass the file name for which I want to extract the information ?
Edit : I want to extract information from the table in html file. For that how can I use html agility parser ? What type of code I should write to use that parser ?
You just asked an almost identical question and deleted it. Here was the answer I gave before:
Try the HTML Agility Pack.
Here's an example:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
Regarding your extra question regarding regex: do not use Regex to parse HTML. It is not a robust solution. The above library can do a much better job.
HtmlAgilityPack....
Next time - search for an answer before. This is duplicate for sure.
Little tutorial.

Categories

Resources