Extract string from HTML

Extract string from HTML - c#

I want to extract the string KLE3KAN918D429 from the following html code:
<td class="Labels"> CODE (Sp Number): </td><td width="40.0%"> KLE3KAN918D429</td>
Is there a method in C# where I can specify the source-text , start string , end string and get the string between start and end ?

You are, as per the comments, probably better off using a parsing library to iterate the DOM structure but if you can make some assumptions about the html you'll be parsing, you could do something like below:
var html = "<td class=\"Labels\"> CODE (Sp Number): </td><td width=\"40.0%\"> KLE3KAN918D429</td>";
var labelIndex = html.IndexOf("<td class=\"Labels\">");
var pctIndex = html.IndexOf("%", labelIndex);
var closeIndex = html.IndexOf("<", pctIndex);
var key = html.Substring(pctIndex + 3, closeIndex - pctIndex - 3).Trim();
System.Diagnostics.Debug.WriteLine(key);
Likely quite brittle but sometimes quick and dirty is all that is required.

As others already suggested, you should use something like HtmlAgilityPack for parsing html. Don't use regular expressions or other hacks for parsing html.
You have several td nodes in your html string. Getting last one is really easy with td[last()] XPath:
string html = "<td class=\"Labels\"> CODE (Sp Number): </td><td width=\"40.0%\"> KLE3KAN918D429</td>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var td = doc.DocumentNode.SelectSingleNode("td[last()]");
var result = td.InnerText.Trim(); // "KLE3KAN918D429"

I really suggest using HTMLAgilityPack for this.
It's as easy as:
var doc = new HtmlDocument();
doc.LoadHtml(#"<td class=""Labels""> CODE (Sp Number): </td><td width=""40.0%""> KLE3KAN918D429</td>");
var tdNode = doc.DocumentNode.SelectSingleNode("//td[#class='Labels' and text()=' CODE (Sp Number): ']/following-sibling::td[1]");
Console.WriteLine(tdNode.InnerText.Trim());
Before you start, add HtmlAgilityPack from NuGet:
Install-Package HtmlAgilityPack

Related

Extracting data from HTML file using c# script

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:

HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}

C# Regular Expressions - Get Second Number, not First

I have the following HTML code:
<td class="actual">106.2% </td>
Which I get the number through two phases:
Regex.Matches(html, "<td class=\"actual\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
Regex.Match(m.Groups[1].Value, #"-?\d+.\d+").Value
The above code lines gives me what I want, the 106.2
The problem is that sometimes the HTML can be a little different, like this:
<td class="actual"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
In this last case, I can only get the 107.2, and I would like to get the 106.4
Is there some regular expression trick to say, I want the second number in the sentence and not the first?

Whenver you have HTML code that comes from different providers or your current one has several CMS that use different HTML formatting style, it is not safe to rely on regex.
I suggest an HtmlAgilityPack based solution:
public string getCleanHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}
And then:
var txt = "<td class=\"actual\">106.2% </td>";
var clean = getCleanHtml(txt);
txt = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
clean = getCleanHtml(txt);
Result: and
You do not have to worry about formatting tags inside and any XML/HTML entity references.
If your text is a substring of the clean HTML string, then you can use Regex or any other string manipulation methods.
UPDATE:
You seem to need the node values from <td> tags. Here is a handy method for you:
private List<string> GetTextFromHtmlTag(string html, string tag)
{
var result = new List<string>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.ChildNodes.Where(p => p.Name.ToLower() == tag.ToLower() && p.GetAttributeValue("class", string.Empty) == "previous"); // SelectNodes("//"+tag);
if (nodes != null)
foreach (var node in nodes)
result.Add(HtmlAgilityPack.HtmlEntity.DeEntitize(node.InnerText));
return result;
}
You can call it like this:
var html = "<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 1.3\">0.9</span></td>\n<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
var res = GetTextFromHtmlTag(html, "td");
If you need to get only specific tags,
If you have texts with a number inside, and you need just the number, you can use a regex for that:
var rx = new Regex(#"[+-]?\d*\.?\d+"); // Matches "-1.23", "+5", ".677"
See demo

Try XML method
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication34
{
class Program
{
static void Main(string[] args)
{
string input = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
XElement element = XElement.Parse(input);
string value = element.Descendants("span").Select(x => (string)x).FirstOrDefault();
}
}
}

I want to share the solution I have found for my problem.
So, I can have HTML tags like the following:
<td class="previous"><span class="revised worse" title="Revised From 1.3">0.9</span></td>
<td class="previous"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
Or simpler:
<td class="previous">51.4</td>
First, I take the entire line, throught the following code:
MatchCollection mPrevious = Regex.Matches(html, "<td class=\"previous\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
And second, I use the following code to extract the numbers only:
foreach (Match m in mPrevious)
{
if (m.Groups[1].Value.Contains("span"))
{
string stringtemp = Regex.Match(m.Groups[1].Value, "-?\\d+.\\d+.\">-?\\d+.\\d+|-?\\d+.\\d+\">-?\\d+.\\d+|-?\\d+.\">-?\\d+|-?\\d+\">-?\\d+").Value;
int indextemp = stringtemp.IndexOf(">");
if (indextemp <= 0) break;
lPrevious.Add(stringtemp.Remove(0, indextemp + 1));
}
else lPrevious.Add(Regex.Match(m.Groups[1].Value, #"-?\d+.\d+|-?\d+").Value);
}
First I start to identify if there is a SPAN tag, if there is, I take the two number together, and I have considered diferent posibilities with the regular expression. Identify a character from where to remove non important information, and remove what I don't want.
It's working perfect.
Thank you all for the support and quick answers.

string html = #"<td class=""actual""><span class=""revised worse"" title=""Revised From 107.2%"">106.4%</span></td>
<td class=""actual"">106.2% </td>";
string patten = #"<td\s+class=""actual"">.*(?<=>)(.+?)(?=</).*?</td>";
foreach (Match match in Regex.Matches(html, patten))
{
Console.WriteLine(match.Groups[1].Value);
}
I have changed the regex as your wish, The output is
106.4%
106.2%

how can I remove with specific tags from html [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use HTML Agility pack
I have html code below:
<div><span class="help">This is text.</span>Hello, this is text.</div>
<div>I have a question.<span class="help">Hi</span></div>
Now, I want to remove text which is between <span class="help"></span> using C#. So, I want to leave only
<div>Hello, this is text.</div>
<div>I have a question.</div>
Anyone has any idea?

You should use Html Agility Pack to work with html.
string text = #"<div><span class=""help"">This is text.</span>Hello, this is text. </div>
<div>I have a question.<span class=""help"">Hi</span></div>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodes = doc.DocumentNode.SelectNodes("//span[#class='help']");
foreach( HtmlNode node in nodes)
{
node.Remove();
}
String result = doc.DocumentNode.InnerHtml;

I have the idea to use Html Agility Pack to parse html.
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // this is your string
var divs = doc.DocumentNode.Elements("div")
.Select(div => string.Format("<div>{0}</div>", div.LastChild.InnerText));

you can use regex
string val = #"<div><span class=""help"">This is text.</span>Hello, this is text.</div><div>I have a question.<span class=""help"">Hi</span></div>";
Regex reg = new Regex("<span .+?</span>", RegexOptions.IgnoreCase | RegexOptions.Singleline);
string ret = reg.Replace(val, "");
Debug.WriteLine(ret);

get the elements to contain the runat="server" so they can be accessed from the codebehind and then when it is suitable try getting the element by its id name and do either
element.innerHTML = ""; or element.innerText = "";

Extract the contents of a string between two string delimiters using match in C#

So, say I'm parsing the following HTML string:
<html>
<head>
RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
</head>
<body>
<table class="table">
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
<tr>Name</tr>
</table>
<body>
</html>
and I want to isolate the contents of ** (everything inside of the table class)
Now, I used regex to accomplish this:
string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);
I've been looking at other ways to do this extraction, and I came across the Match object in C#.
I'm attempting to do something like this:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");
The purpose of the above was to hopefully extract a match value between the two delimiters, but, when I try to run it the match value is:
match.value = </table>
MY question, as such, is: is there a way to extract data from my string that is slightly easier/more readable/shorter than my method using regex? For this simple example, regex is fine, but for more complex examples, I find myself with the coding equivalent of scribbles all over my screen.
I would really like to use match, because it seems like a very neat and tidy class, but I can't seem to get it working for my needs. Can anyone help me with this?
Thank you very much!

Use an HTML parser, like HTML Agility Pack.
var doc = new HtmlDocument();
using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
doc.Load(stream);
}
var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;

You can use XPath with the HTmlAgilityPack:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[#class='table']");
foreach (var ele in elements)
{
MessageBox.Show(ele.OuterHtml);
}

You have add parenthesis in the regular expression in order to capture the matches:
Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");
Anyways it seems that only Chuck Norris can parse HTML with regex correctly.

How do I use HTML Agility Pack to edit an HTML snippet

So I have an HTML snippet that I want to modify using C#.
<div>
This is a specialSearchWord that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that specialSearchWord again.
</div>
and I want to transform it to this:
<div>
This is a <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> that I want to link to
<img src="anImage.jpg" />
A hyperlink
Some more text and that <a class="special" href="http://mysite.com/search/specialSearchWord">specialSearchWord</a> again.
</div>
I'm going to use HTML Agility Pack based on the many recommendations here, but I don't know where I'm going. In particular,
How do I load a partial snippet as a string, instead of a full HTML document?
How do edit?
How do I then return the text string of the edited object?

The same as a full HTML document. It doesn't matter.
The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}

Answers:
There may be a way to do this but I don't know how. I suggest
loading the entire document.
Use a combination of XPath and regular
expressions
See the code below for a contrived example. You may have
other constraints not mentioned but this code sample should get you
started.
Note that your Xpath expression may need to be more complex to find the div that you want.
HtmlDocument doc = new HtmlDocument();
doc.Load(yourHtmlFile);
HtmlNode divNode = doc.DocumentNode.SelectSingleNode("//div[2]");
string newDiv = Regex.Replace(divNode.InnerHtml, #"specialSearchWord",
"<a class='special' href='http://etc'>specialSearchWord</a>");
divNode.InnerHtml = newDiv;
Console.WriteLine(doc.DocumentNode.OuterHtml);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract string from HTML - c#

I want to extract the string KLE3KAN918D429 from the following html code: <td class="Labels"> CODE (Sp Number): </td><td width="40.0%"> KLE3KAN918D429</td> Is there a method in C# where I can specify the source-text , start string , end string and get the string between start and end ?

Related

Extracting data from HTML file using c# script

C# Regular Expressions - Get Second Number, not First

how can I remove with specific tags from html [duplicate]

Extract the contents of a string between two string delimiters using match in C#

How do I use HTML Agility Pack to edit an HTML snippet

Categories

Resources