C# Regular Expressions - Get Second Number, not First - c#

I have the following HTML code:
<td class="actual">106.2% </td>
Which I get the number through two phases:
Regex.Matches(html, "<td class=\"actual\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
Regex.Match(m.Groups[1].Value, #"-?\d+.\d+").Value
The above code lines gives me what I want, the 106.2
The problem is that sometimes the HTML can be a little different, like this:
<td class="actual"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
In this last case, I can only get the 107.2, and I would like to get the 106.4
Is there some regular expression trick to say, I want the second number in the sentence and not the first?

Whenver you have HTML code that comes from different providers or your current one has several CMS that use different HTML formatting style, it is not safe to rely on regex.
I suggest an HtmlAgilityPack based solution:
public string getCleanHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}
And then:
var txt = "<td class=\"actual\">106.2% </td>";
var clean = getCleanHtml(txt);
txt = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
clean = getCleanHtml(txt);
Result: and
You do not have to worry about formatting tags inside and any XML/HTML entity references.
If your text is a substring of the clean HTML string, then you can use Regex or any other string manipulation methods.
UPDATE:
You seem to need the node values from <td> tags. Here is a handy method for you:
private List<string> GetTextFromHtmlTag(string html, string tag)
{
var result = new List<string>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.ChildNodes.Where(p => p.Name.ToLower() == tag.ToLower() && p.GetAttributeValue("class", string.Empty) == "previous"); // SelectNodes("//"+tag);
if (nodes != null)
foreach (var node in nodes)
result.Add(HtmlAgilityPack.HtmlEntity.DeEntitize(node.InnerText));
return result;
}
You can call it like this:
var html = "<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 1.3\">0.9</span></td>\n<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
var res = GetTextFromHtmlTag(html, "td");
If you need to get only specific tags,
If you have texts with a number inside, and you need just the number, you can use a regex for that:
var rx = new Regex(#"[+-]?\d*\.?\d+"); // Matches "-1.23", "+5", ".677"
See demo

Try XML method
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication34
{
class Program
{
static void Main(string[] args)
{
string input = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
XElement element = XElement.Parse(input);
string value = element.Descendants("span").Select(x => (string)x).FirstOrDefault();
}
}
}

I want to share the solution I have found for my problem.
So, I can have HTML tags like the following:
<td class="previous"><span class="revised worse" title="Revised From 1.3">0.9</span></td>
<td class="previous"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
Or simpler:
<td class="previous">51.4</td>
First, I take the entire line, throught the following code:
MatchCollection mPrevious = Regex.Matches(html, "<td class=\"previous\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
And second, I use the following code to extract the numbers only:
foreach (Match m in mPrevious)
{
if (m.Groups[1].Value.Contains("span"))
{
string stringtemp = Regex.Match(m.Groups[1].Value, "-?\\d+.\\d+.\">-?\\d+.\\d+|-?\\d+.\\d+\">-?\\d+.\\d+|-?\\d+.\">-?\\d+|-?\\d+\">-?\\d+").Value;
int indextemp = stringtemp.IndexOf(">");
if (indextemp <= 0) break;
lPrevious.Add(stringtemp.Remove(0, indextemp + 1));
}
else lPrevious.Add(Regex.Match(m.Groups[1].Value, #"-?\d+.\d+|-?\d+").Value);
}
First I start to identify if there is a SPAN tag, if there is, I take the two number together, and I have considered diferent posibilities with the regular expression. Identify a character from where to remove non important information, and remove what I don't want.
It's working perfect.
Thank you all for the support and quick answers.

string html = #"<td class=""actual""><span class=""revised worse"" title=""Revised From 107.2%"">106.4%</span></td>
<td class=""actual"">106.2% </td>";
string patten = #"<td\s+class=""actual"">.*(?<=>)(.+?)(?=</).*?</td>";
foreach (Match match in Regex.Matches(html, patten))
{
Console.WriteLine(match.Groups[1].Value);
}
I have changed the regex as your wish, The output is
106.4%
106.2%

Related

Extracting data from HTML file using c# script

What I need to do : Extract (Information of From, To, Cc and Subject ) and remove them from HTML file. Without the use of any 3rd party ( HTMLAgilityPack, etc)
What I am having trouble with: What will be my approach to get the following(from,to,subject,cc) from the html tags?
Steps I tried: I tried to get the index of <p class=MsoNormal> and the last index of the email #sampleemail.com but I think that is a bad approach since in some html files there will be a lot of
"<p class=MsNormal>" , regarding the removal of the from,to,cc and subject I just used the string.Remove(indexOf, i counted the characters from indexOf to lastIndexOf) function and it worked
Sample tag containing information of from:
<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p>
HTML FILE output:
HTMLAgilityPack is your friend. Simply using XPath like //p[#class ='MsoNormal'] to get tags content in HTML
public static void Main()
{
var html =
#"<p class=MsoNormal style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//p[#class ='MsoNormal']");
foreach(var node in nodes)
Console.WriteLine(node.InnerText);
}
Result:
From:1234#sampleemail.com
Update
We may use Regex to write this simple parser. But remember that it cannot clear all cases for complicated html document.
public static void MainFunc()
{
string str = #"<p class='MsoNormal' style='margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid align:none;text-autospace:none'><b><span style='color:black'>From:<span style='mso-tab-count:1'></span></span></b><span style='color:black'>1234#sampleemail.com<o:p></o:p></span></p> ";
var result = Regex.Replace(str, "<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>", "");
Console.WriteLine(result);
}

Regular expression to match everything, except HTML tags

<tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr>
I want to only match ABC or anything else that stand between
<td>
</td> (before and after ABC)
This Patter doesnt work for me:
((?!<tr><td>[D-M][i-r],[' ][0-3][0-9]\\.[0-1][0-9]\\.[0-9][0-9]</td><td>[1-9][0-2]?</td><td>[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?</td><td>(---|[A-Z]?[A-Z]?[A-Z]?[A-Z]?[1-5]?)</td><td>).*(?!</td></tr>))
Do you have any idea?
Thx for help
As Amy said, don't use regex to parse HTML. You can install Html Agility Pack from NuGet and use System.Linq Namespace to parse it.
For example here:
string html = "<html><head></head><body><p class='testclass'>This is a paragraph.</p><table><tr><td>Di, 12.04.16</td><td>1</td><td>D</td><td>D</td><td>255</td><td>ABC</td><tr></table></body></html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var programmes = doc.DocumentNode.Descendants().Where(d => d.GetAttributeValue("class", "") == "testclass");
var trs = doc.DocumentNode.Descendants("tr"); // Give you all the trs
foreach (var tr in trs)
{
var tds = tr.Descendants("td").ToArray(); // Get all the tds
//Sample, show the result in a TextBlock
foreach (var td in tds)
{
txt.Text = txt.Text + " " + td.InnerText;
}
}
The result is so:

giving error while counting alt tag using regex ..Only assignment, call

giving error while counting alt tag using regex- Only assignment, call, increment, decrement, and new object expressions can be used as a statement and ; Expected
i want to count img tags which is having alt tag and empty alt tag using c#
MatchCollection ImgAltTag = Regex.Matches(strIn, "<img[^>]*alt=['"].+['"]", RegexOptions.IgnoreCase | RegexOptions.Multiline);
sample img tags
<img src="alt.png" class="absmiddle" alt="" />
<img src="alt.png" class="absmiddle" />
it should give count 2
Don't use Regex for this. Much easier to use XML Ling
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string xml =
"<Root>" +
"<img src=\"alt.png\" class=\"absmiddle\" alt=\"\" />" +
"<img src=\"alt.png\" class=\"absmiddle\" />" +
"</Root>";
XElement root = XElement.Parse(xml);
int count = root.Descendants("img").Where(x => x.Attribute("alt") == null || x.Attribute("alt").Value.Length == 0).Count();
}
}
}
​
If you need to work with HTML, use an HTML parser.
Here is an HtmlAgilityPack based answer.
Suppose you have:
<img src="alt.png" class="absmiddle" alt="" />
<img src="alt.png" class="absmiddle" />
<img src="ff" />
There is 1 img tag you need to obtain as it contains alt. You need an XPath that is //img[#alt] to get all of them, regardless if they have value inside or not. No need to worry about the quotes, either.
public int HtmlAgilityPackGetImgTagsWithAlt(string html)
{
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes("//img[#alt]");
return nodes != null ? nodes.Count : -1;
}
And the result is 1.

Grab all text from html with Html Agility Pack

Input
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
Output
foo
bar
baz
I know of htmldoc.DocumentNode.InnerText, but it will give foobarbaz - I want to get each text, not all at a time.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
This does what you need, but I am not sure if this is the best way. Maybe you should iterate through something other than DescendantNodesAndSelf for optimal performance.
I was in the need of a solution that extracts all text but discards the content of script and style tags. I could not find it anywhere, but I came up with the following which suits my own needs:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
The specified example for html content:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
will produce the following output:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(#"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
This workaround is based on Html Agility Pack. You can also install it via NuGet (package name: HtmlAgilityPack).
https://github.com/jamietre/CsQuery
have you tried CsQuery? Though not being maintained actively - it's still my favorite for parsing HTML to Text. Here's a one liner of how simple it is to get the Text from HTML.
var text = CQ.CreateDocument(htmlText).Text();
Here's a complete console application:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
I understand that OP has asked for HtmlAgilityPack only but CsQuery is another unpopular and one of the best solutions I've found and wanted to share if someone finds this helpful. Cheers!
I just changed and fixed some people's answers to work better:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}

encode html in Asp.net C# but leave tags intact

I need to encode a whole text while leaving the < and > intact.
example
<p>Give me 100.000 €!</p>
must become:
<p>Give me 100.000 €!</p>
the html tags must remain intact
Use a regular expression that matches either a tag or what's between tags, and encode what's between:
html = Regex.Replace(
html,
"(<[^>]+>|[^<]+)",
m => m.Value.StartsWith("<") ? m.Value : HttpUtility.HtmlEncode(m.Value)
);
you might go for Html Agility Pack and then encode the values of the tags
Maybe use string.replace for just those characters you want to encode?
You could use HtmlTextWriter in addition to htmlencode. So you would use HtmlTextWriter to setup your <p></p> and then just set the body of the <p></p> using HtmlEncode. HtmlTextWriter allow ToString(); and a bunch of other methods so it shouldn't be much more code.
As others have suggested, this can be achieved with HtmlAgilityPack.
public static class HtmlTextEncoder
{
public static string HtmlEncode(string html)
{
if (html == null) return null;
var doc = new HtmlDocument();
doc.LoadHtml(html);
EncodeNode(doc.DocumentNode);
doc.OptionWriteEmptyNodes = true;
using (var s = new MemoryStream())
{
doc.Save(s);
var encoded = doc.Encoding.GetString(s.ToArray());
return encoded;
}
}
private static void EncodeNode(HtmlNode node)
{
if (node.HasChildNodes)
{
foreach (var childNode in node.ChildNodes)
{
if (childNode.NodeType == HtmlNodeType.Text)
{
childNode.InnerHtml = HttpUtility.HtmlEncode(childNode.InnerHtml);
}
else
{
EncodeNode(childNode);
}
}
}
else if (node.NodeType == HtmlNodeType.Text)
{
node.InnerHtml = HttpUtility.HtmlEncode(node.InnerHtml);
}
}
}
This iterates through all the nodes in the HTML, and replaces any text nodes with HTML encoded text.
I've created a .NET fiddle to demonstrate this technique.

Categories

Resources