Parsing big string (HTML code)

Parsing big string (HTML code) - c#

I'm looking to parse some information on my application.
Let's say we have somewhere in that string:
<tr class="tablelist_bg1">
<td>Beja</td>
<td class="text_center">---</td>
<td class="text_center">19.1</td>
<td class="text_center">10.8</td>
<td class="text_center">NW</td>
<td class="text_center">50.9</td>
<td class="text_center">0</td>
<td class="text_center">1016.6</td>
<td class="text_center">---</td>
<td class="text_center">---</td>
</tr>
All rest that's above or below this doesn't matter. Remember this is all inside a string.
I want to get the values inside the td tags: ---, 19.1, 10.8, etc.
Worth knowing that there are many entries like this on the page.
Probably also a good idea to link the page here.
As you probably guessed I have absolutely no idea how to do this... none of the functions I know I can perform over the string (split etc.) help.
Thanks in advance

Just use String.IndexOf(string, int) to find a "<td", again to find the next ">", and again to find "</td>". Then use String.Substring to pull out a value. Put this in a loop.
public static List<string> ParseTds(string input)
{
List<string> results = new List<string>();
int index = 0;
while (true)
{
string next = ParseTd(input, ref index);
if (next == null)
return results;
results.Add(next);
}
}
private static string ParseTd(string input, ref int index)
{
int tdIndex = input.IndexOf("<td", index);
if (tdIndex == -1)
return null;
int gtIndex = input.IndexOf(">", tdIndex);
if (gtIndex == -1)
return null;
int endIndex = input.IndexOf("</td>", gtIndex);
if (endIndex == -1)
return null;
index = endIndex;
return input.Substring(gtIndex + 1, endIndex - gtIndex - 1);
}

Assuming your string is valid XHTML, you can use use an XML parser to get the content you want. There's a simple example here that shows how to use XmlTextReader to parse XML content. The example reads from a file, but you can change it to read from a string:
new XmlTextReader(new StringReader(someString));
You specifically want to keep track of td element nodes, and the text node that follows them will contain the values you want.

Use a loop to load each non empty line from the file into a String
Process the string character by character
Check for characters indicating the the begining of a td tag
use a substring function or just bulild a new string character by character to get all the content until the </td> tag begins.

Related

How would I Strip Html from a string and set a character limit?

I'm getting a string from a list of items, The string is currently displayed as "item.ItemDescription" (the 9th row below)
I want to strip out all html from this string. And set a character limit of 250 after the html is stripped.
Is there a simple way of doing this?
I saw there was a posts saying to install HTML Agility Pack but I was looking for something simpler.
EDIT:
It does not always contain html, If the client wanted to add a Bold or italic tag to an items name in the description it would show up as <"strong">Item Name<"/strong"> for instance, I want to strip out all html no matter what is entered.
<tbody>
#foreach (var itemin Model.itemList)
{
<tr id="#("__filterItem_" + item.EntityId + "_" + item.EntityTypeId)">
<td>
#Html.ActionLink(item.ItemName, "Details", "Item", new { id = item.EntityId }, null)
</td>
<td>
item.ItemDescription
</td>
<td>
#if (Model.IsOwner)
{
<a class="btnDelete" title="Delete" itemid="#(item.EntityId)" entitytype="#item.EntityTypeId" filterid="#Model.Id">Delete</a>
}
</td>
</tr>
}
</tbody>

Your best option IMO is to night get into a parsing nightmare with all the possible values, why not simply inject a class=someCssClassName into the <td> as an attribute. Then control the length, color whatever with CSS.
An even better idea is to assign a class to the containing <tr class=trClass> and then have the CSS apply lengths to child <td> elements.

You could do something like this to remove all tags (opening, closing, and self-closing) from the string, but it may have the unintended consequence of removing things the user entered that weren't meant to be html tags:
text = Regex.Replace(text, "<\/?[^>]*\/?>", String.Empty);
Instead, I would recommend something like this and letting the user know html isn't supported:
text = text.Replace("<", "<");
text = text.Replace(">", ">");
Just remember to check for your 250 character limit before the conversion:
text = text.Substring(0, 250);

This Regex will select any html tags (including the ones with double quotes such as <"strong">:
<[^>]*>
Look here: http://regexr.com/3cge4
Using C# regular expressions to remove HTML tags
From there, you can simply check the string size and display appropriately.
var itemDescriptionStripped = Regex.Replace(item.ItemDescription, #"<[^>]*>", String.Empty);
if (itemDescriptionStripped.Length >= 250)
itemDescriptionStripped.Substring(0,249);
else
itemDescriptionStripped;

Extract a text from a file c#

I got a file .mail that contains:
`
FromFild=xxx#gmail.com
ToFild=yyy#gmai.com
SubjectFild=Test
Message=
<b><font size="3" color="blue">testing</font> </b>
<table>
<tr>
<th>Question</th>
<th>Answer</th>
<th>Correct?</th>
</tr>
<tr>
<td>What is the capital of Burundi?</td>
<td>Bujumburra</td>
<td>Yes</td>
</tr>
<tr>
<td>What is the capital of France?</td>
<td>F</td>
<td>Erm... sort of</td>
</tr>
</table>
Message=END
#at least one empty line needed at the end!
`
And i need to extract and save only the text that is between Message= and Message=END.
I tried with split('=').Last/First(). Not good.I can not use Substring, as it accepts only int ofIndex. I am noob and i can not think of a sollution. Can you give a hint, please?

You can use this Regular Expression :
/Message=(?<messagebody>(.*))Message=END/s
Then the code to get message :
string fileContent; //The content of your .mail file
MatchCollection match = Regex.Matches(fileContent, "/Message=(?<messagebody>(.*))Message=END/s");
string message = match[0].Groups["messagebody"].Value;

I will assume that there is no constant number of lines in the text file or in the message your'e looking for that I can rely on.
string prefix = "Message=";
string postfix = "Message=END";
var text = File.ReadAllText("a.txt");
var messageStart = text.IndexOf(prefix) + prefix.Length;
var messageStop = text.IndexOf(postfix);
var result = text.Substring(messageStart, messageStop - messageStart);

How can I grab a value between two tags from a string

I am trying to grab data from a webpage. I have downloaded the webpage into a string variable.
I am wondering how I can grab the value between two tags. I have included a snippet of the downloaded string and the value I want is 895
<div class="split2r right">
<strong>Avg. asking rent in M4:</strong>
<strong class="price big">£897 pcm</strong><br>
<strong>No. of properties to rent in M4:</strong> <strong><a data-ga-category="Area stats" data-ga-action="properties_to_rent" data-ga-label="/tracking/home-values/results/" href="/to-rent/property/manchester/isaac-way/m4-7ed/">225</a></strong>
</div>
A code example would be great.

This is actually quite easy using the HtmlAgilityPack library to parse the HTML.
The first step is to add a reference to the HtmlAgilityPack library. Then you can start parsing the HTML:
const string Html = "<strong>Avg. price:</strong> <strong class=\"price big\">£895 pcm</strong><br><strong>this is the price of zed headphones</strong>";
var doc = new HtmlDocument();
doc.LoadHtml(Html);
The next step is to find the element you are looking for, in this case that is the <strong> element with its class set to price big:
var priceNode = doc.DocumentNode.SelectSingleNode("//strong[#class='price big']");
Now our final step is to retrieve the actual number from the node's InnerText property. Probably the best way to do this is through a regular expression, which can be quite simple if we assume that the required number is the only number in the inner text of the node:
var priceMatch = Regex.Match(priceNode.InnerText, #"(\d+)");
Console.WriteLine(priceMatch); // Will output 895

private void button1_Click(object sender, EventArgs e)
{
string input = #"<strong class=""price big"">£895 pcm</strong><br>";
MatchCollection mc = Regex.Matches(input, ">£\d{0-5} pcm");
foreach (Match m in mc)
{
Add To List Convert.ToInt32(m);
}
}

Assuming your string value is called "source" and all extracts are formatted as the example
var value = Regex.Replace(source, #"\D", string.Empty);

C#: Get exact substring from HTML code using IndexOf and LastIndexOf

I have a HTML page retrieved using the GetResponseStream() in C#. I need an exact value (int) that comes from that page, which is different every time I run the program. Nevertheless, the structure of the HTML code is the same, in particular:
(...)
<td colspan="2" class="txtnormal"><div align="right"> TAX:</div></td>
<td class="txtnormal"><div align="right"><strong>0.00</strong></div></td>
<td colspan="2"> </td>
(...)
and
(...)
<td colspan="2"><div align="right" class="txtnormal">Total:</div></td>
<td class="txtnormal"><div align="right"><strong>10.00</strong></div></td>
<td colspan="2"> </td>
(...)
Notice that the code is repeated in the same page (i.e: <td class="txtnormal"><div align="right"><strong>VALUE</strong></div></td>), but the title of the values (TAX and Total) are the only different thing (the actual value could be the same).
I would like to store in a variable the Total value, this is: 10.0 in this case.
I tried this:
int first = responseFromServer.IndexOf("<td class= \"txtnormal\"><div align=\"right\"><strong>") + "<td class=\"txtnormal\"><div align=\"right\"><strong>".Length;
int last = responseFromServer.LastIndexOf("</strong></div></td>");
string value = responseFromServer.Substring(first, last - first);
But i get bad results, the value stored in value of ALL the HTML page until the value (is for the difference I´m doing).
Do you know how could I get the exact value, this is: the sub-string between the text I pasted?
Thank you very much.

To scrape from a page, you have a couple of options. The "best" is to use the DOM to find the node(s) in question and pull it's value. If you can't use the DOM for some reason, you can move to regex and pull the value that way.
Your method is "okay" in many instances, as long as you can be sure the site owner will never set up another instance of "</strong></div></td>" anywhere downstream. This is a risky assumption.
What value are you getting for the int string? that will tell you whether or not your particular pattern is working correctly. And I would consider the HTML DOM still, as it is a more accurate way to traverse the nodes.

I think Regex is your friend here:
using System;
using System.Text.RegularExpressions;
namespace SimpleApp
{
class Program
{
static void Main(string[] args)
{
Regex theRegex = new Regex(#">Total:<.+?<strong>(.+?)</strong>");
string str = #"<td colspan=""2""><div align=""right"" class=""txtnormal"">Total:</div></td>" +
#"<td class=""txtnormal""><div align=""right""><strong>10.00</strong></div></td>" +
#"<td colspan=""2""> </td>";
if (theRegex.Match(str).Success)
{
Console.WriteLine("Found Total of " + theRegex.Match(str).Result("$1"));
}
else
{
Console.WriteLine("Not found");
}
Console.ReadLine();
}
}
}
Obviously your HTML page might have other things that could trip this simple regular expression up but you get the idea.

Simple text to HTML conversion

I have a very simple asp:textbox with the multiline attribute enabled. I then accept just text, with no markup, from the textbox. Is there a common method by which line breaks and returns can be converted to <p> and <br/> tags?
I'm not looking for anything earth shattering, but at the same time I don't just want to do something like:
html.Insert(0, "<p>");
html.Replace(Enviroment.NewLine + Enviroment.NewLine, "</p><p>");
html.Replace(Enviroment.NewLine, "<br/>");
html.Append("</p>");
The above code doesn't work right, as in generating correct html, if there are more than 2 line breaks in a row. Having html like <br/></p><p> is not good; the <br/> can be removed.

I know this is old, but I couldn't find anything better after some searching, so here is what I'm using:
public static string TextToHtml(string text)
{
text = HttpUtility.HtmlEncode(text);
text = text.Replace("\r\n", "\r");
text = text.Replace("\n", "\r");
text = text.Replace("\r", "<br>\r\n");
text = text.Replace(" ", " ");
return text;
}
If you can't use HttpUtility for some reason, then you'll have to do the HTML encoding some other way, and there are lots of minor details to worry about (not just <>&).
HtmlEncode only handles the special characters for you, so after that I convert any combo of carriage-return and/or line-feed to a BR tag, and any double-spaces to a single-space plus a NBSP.
Optionally you could use a PRE tag for the last part, like so:
public static string TextToHtml(string text)
{
text = "<pre>" + HttpUtility.HtmlEncode(text) + "</pre>";
return text;
}

Your other option is to take the text box contents and instead of trying for line a paragraph breaks just put the text between PRE tags. Like this:
<PRE>
Your text from the text box...
and a line after a break...
</PRE>

Depending on exactly what you are doing with the content, my typical recommendation is to ONLY use the <br /> syntax, and not to try and handle paragraphs.

How about throwing it in a <pre> tag. Isn't that what it's there for anyway?

I know this is an old post, but I've recently been in a similar problem using C# with MVC4, so thought I'd share my solution.
We had a description saved in a database. The text was a direct copy/paste from a website, and we wanted to convert it into semantic HTML, using <p> tags. Here is a simplified version of our solution:
string description = getSomeTextFromDatabase();
foreach(var line in description.Split('\n')
{
Console.Write("<p>" + line + "</p>");
}
In our case, to write out a variable, we needed to prefix # before any variable or identifiers, because of the Razor syntax in the ASP.NET MVC framework. However, I've shown this with a Console.Write, but you should be able to figure out how to implement this in your specific project based on this :)

Combining all previous plus considering titles and subtitles within the text comes up with this:
public static string ToHtml(this string text)
{
var sb = new StringBuilder();
var sr = new StringReader(text);
var str = sr.ReadLine();
while (str != null)
{
str = str.TrimEnd();
str.Replace(" ", " ");
if (str.Length > 80)
{
sb.AppendLine($"<p>{str}</p>");
}
else if (str.Length > 0)
{
sb.AppendLine($"{str}</br>");
}
str = sr.ReadLine();
}
return sb.ToString();
}
the snippet could be enhanced by defining rules for short strings

I understand that I was late with the answer for 13 years)
but maybe someone else needs it
sample line 1 \r\n
sample line 2 (last at paragraph) \r\n\r\n [\r\n]+
sample line 3 \r\n
Example code
private static Regex _breakRegex = new("(\r?\n)+");
private static Regex _paragrahBreakRegex = new("(?:\r?\n){2,}");
public static string ConvertTextToHtml(string description) {
string[] descrptionParagraphs = _paragrahBreakRegex.Split(description.Trim());
if (descrptionParagraphs.Length > 0)
{
description = string.Empty;
foreach (string line in descrptionParagraphs)
{
description += $"<p>{line}</p>";
}
}
return _breakRegex.Replace(description, "<br/>");
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing big string (HTML code) - c#

Use a loop to load each non empty line from the file into a String Process the string character by character Check for characters indicating the the begining of a td tag use a substring function or just bulild a new string character by character to get all the content until the </td> tag begins.

Related

How would I Strip Html from a string and set a character limit?

Extract a text from a file c#

How can I grab a value between two tags from a string

C#: Get exact substring from HTML code using IndexOf and LastIndexOf

Simple text to HTML conversion

Categories

Resources