C#: Get exact substring from HTML code using IndexOf and LastIndexOf - c#

I have a HTML page retrieved using the GetResponseStream() in C#. I need an exact value (int) that comes from that page, which is different every time I run the program. Nevertheless, the structure of the HTML code is the same, in particular:
(...)
<td colspan="2" class="txtnormal"><div align="right"> TAX:</div></td>
<td class="txtnormal"><div align="right"><strong>0.00</strong></div></td>
<td colspan="2"> </td>
(...)
and
(...)
<td colspan="2"><div align="right" class="txtnormal">Total:</div></td>
<td class="txtnormal"><div align="right"><strong>10.00</strong></div></td>
<td colspan="2"> </td>
(...)
Notice that the code is repeated in the same page (i.e: <td class="txtnormal"><div align="right"><strong>VALUE</strong></div></td>), but the title of the values (TAX and Total) are the only different thing (the actual value could be the same).
I would like to store in a variable the Total value, this is: 10.0 in this case.
I tried this:
int first = responseFromServer.IndexOf("<td class= \"txtnormal\"><div align=\"right\"><strong>") + "<td class=\"txtnormal\"><div align=\"right\"><strong>".Length;
int last = responseFromServer.LastIndexOf("</strong></div></td>");
string value = responseFromServer.Substring(first, last - first);
But i get bad results, the value stored in value of ALL the HTML page until the value (is for the difference I´m doing).
Do you know how could I get the exact value, this is: the sub-string between the text I pasted?
Thank you very much.

To scrape from a page, you have a couple of options. The "best" is to use the DOM to find the node(s) in question and pull it's value. If you can't use the DOM for some reason, you can move to regex and pull the value that way.
Your method is "okay" in many instances, as long as you can be sure the site owner will never set up another instance of "</strong></div></td>" anywhere downstream. This is a risky assumption.
What value are you getting for the int string? that will tell you whether or not your particular pattern is working correctly. And I would consider the HTML DOM still, as it is a more accurate way to traverse the nodes.

I think Regex is your friend here:
using System;
using System.Text.RegularExpressions;
namespace SimpleApp
{
class Program
{
static void Main(string[] args)
{
Regex theRegex = new Regex(#">Total:<.+?<strong>(.+?)</strong>");
string str = #"<td colspan=""2""><div align=""right"" class=""txtnormal"">Total:</div></td>" +
#"<td class=""txtnormal""><div align=""right""><strong>10.00</strong></div></td>" +
#"<td colspan=""2""> </td>";
if (theRegex.Match(str).Success)
{
Console.WriteLine("Found Total of " + theRegex.Match(str).Result("$1"));
}
else
{
Console.WriteLine("Not found");
}
Console.ReadLine();
}
}
}
Obviously your HTML page might have other things that could trip this simple regular expression up but you get the idea.

Related

How would I Strip Html from a string and set a character limit?

I'm getting a string from a list of items, The string is currently displayed as "item.ItemDescription" (the 9th row below)
I want to strip out all html from this string. And set a character limit of 250 after the html is stripped.
Is there a simple way of doing this?
I saw there was a posts saying to install HTML Agility Pack but I was looking for something simpler.
EDIT:
It does not always contain html, If the client wanted to add a Bold or italic tag to an items name in the description it would show up as <"strong">Item Name<"/strong"> for instance, I want to strip out all html no matter what is entered.
<tbody>
#foreach (var itemin Model.itemList)
{
<tr id="#("__filterItem_" + item.EntityId + "_" + item.EntityTypeId)">
<td>
#Html.ActionLink(item.ItemName, "Details", "Item", new { id = item.EntityId }, null)
</td>
<td>
item.ItemDescription
</td>
<td>
#if (Model.IsOwner)
{
<a class="btnDelete" title="Delete" itemid="#(item.EntityId)" entitytype="#item.EntityTypeId" filterid="#Model.Id">Delete</a>
}
</td>
</tr>
}
</tbody>
Your best option IMO is to night get into a parsing nightmare with all the possible values, why not simply inject a class=someCssClassName into the <td> as an attribute. Then control the length, color whatever with CSS.
An even better idea is to assign a class to the containing <tr class=trClass> and then have the CSS apply lengths to child <td> elements.
You could do something like this to remove all tags (opening, closing, and self-closing) from the string, but it may have the unintended consequence of removing things the user entered that weren't meant to be html tags:
text = Regex.Replace(text, "<\/?[^>]*\/?>", String.Empty);
Instead, I would recommend something like this and letting the user know html isn't supported:
text = text.Replace("<", "<");
text = text.Replace(">", ">");
Just remember to check for your 250 character limit before the conversion:
text = text.Substring(0, 250);
This Regex will select any html tags (including the ones with double quotes such as <"strong">:
<[^>]*>
Look here: http://regexr.com/3cge4
Using C# regular expressions to remove HTML tags
From there, you can simply check the string size and display appropriately.
var itemDescriptionStripped = Regex.Replace(item.ItemDescription, #"<[^>]*>", String.Empty);
if (itemDescriptionStripped.Length >= 250)
itemDescriptionStripped.Substring(0,249);
else
itemDescriptionStripped;

Extract a text from a file c#

I got a file .mail that contains:
`
FromFild=xxx#gmail.com
ToFild=yyy#gmai.com
SubjectFild=Test
Message=
<b><font size="3" color="blue">testing</font> </b>
<table>
<tr>
<th>Question</th>
<th>Answer</th>
<th>Correct?</th>
</tr>
<tr>
<td>What is the capital of Burundi?</td>
<td>Bujumburra</td>
<td>Yes</td>
</tr>
<tr>
<td>What is the capital of France?</td>
<td>F</td>
<td>Erm... sort of</td>
</tr>
</table>
Message=END
#at least one empty line needed at the end!
`
And i need to extract and save only the text that is between Message= and Message=END.
I tried with split('=').Last/First(). Not good.I can not use Substring, as it accepts only int ofIndex. I am noob and i can not think of a sollution. Can you give a hint, please?
You can use this Regular Expression :
/Message=(?<messagebody>(.*))Message=END/s
Then the code to get message :
string fileContent; //The content of your .mail file
MatchCollection match = Regex.Matches(fileContent, "/Message=(?<messagebody>(.*))Message=END/s");
string message = match[0].Groups["messagebody"].Value;
I will assume that there is no constant number of lines in the text file or in the message your'e looking for that I can rely on.
string prefix = "Message=";
string postfix = "Message=END";
var text = File.ReadAllText("a.txt");
var messageStart = text.IndexOf(prefix) + prefix.Length;
var messageStop = text.IndexOf(postfix);
var result = text.Substring(messageStart, messageStop - messageStart);

how to generate text boxes dynamicaly in asp.net

I am working on a hrms project in which I have to make the program for time sheet.
I need to take starting date and ending date from the user and depending on the difference of the two dates i have to generate textboxes fro entering time dynamically.that too in the specified place.
can anybody help me out.
I have done something very similar to that. Basically, I create a HTML table and create as many columns as needed and create a TextBox in each of those columns like this:
myColumn.InnerHtml = "<table>";
int length = THE NUMBER OF TEXTBOXES YOU WANT TO ADD BASED ON THE DATES;
int i = 1;
while (i <= length)
{
myColumn.InnerHtml += "<tr>";
myColumn.InnerHtml += "<td><input id='whatever" + i.ToString() + "' type='text' runat='server'></td>";
unavailableColumn.InnerHtml += "</tr>";
i++;
}
unavailableColumn.InnerHtml += "</table>";
myColumn is declared in the aspx like this:
<table>
<tr>
<td id="myColumn" runat="server" visible="false" style="width: 35%; vertical-align: top">
</td>
</tr>
</table>
This code should be added when you have entered the two dates, in order to create the textboxes dynamically. Hope this helps!
You need to either:
a) Take a postback with entered values, and generate new textboxes in CreateChildControls, depending on the values
or
b) Add them with JavaScript and take their values from Request manually.

Html Agility Pack + Get specific node

Hello i have a problem with my application.
I need to pick out a specific text between two nodes.
The html page looks like this
<td align="right" width="186">Text1</td>
<td align="center" width="51">? - ?</td>
<td width="186">Text2</td>`
I can pick out Text1 and Text2 with:
HtmlNodeCollection cols = doc.DocumentNode.SelectNodes("//td[#width='186']");<br />
foreach (HtmlNode col in cols)<br />
{
if (col.InnerText == "Text1")
{
Label1.Text = col.InnerText;
}
}
The reason why i have the if-condition is because there are more td's in the page. And i need to specifically pick out the one who got "Text1" in it.
But the problem is how i can parse out the text "? - ?" There are more text in the document also having the text "? - ?" but i need to pick out specifically the one between my two other nodes..
The result should be Text1 ? - ? Text2 etc..
I guess it has something to do with nextchild or sibling etcetera?
You can check col.NextSibling.InnerText.

Parsing big string (HTML code)

I'm looking to parse some information on my application.
Let's say we have somewhere in that string:
<tr class="tablelist_bg1">
<td>Beja</td>
<td class="text_center">---</td>
<td class="text_center">19.1</td>
<td class="text_center">10.8</td>
<td class="text_center">NW</td>
<td class="text_center">50.9</td>
<td class="text_center">0</td>
<td class="text_center">1016.6</td>
<td class="text_center">---</td>
<td class="text_center">---</td>
</tr>
All rest that's above or below this doesn't matter. Remember this is all inside a string.
I want to get the values inside the td tags: ---, 19.1, 10.8, etc.
Worth knowing that there are many entries like this on the page.
Probably also a good idea to link the page here.
As you probably guessed I have absolutely no idea how to do this... none of the functions I know I can perform over the string (split etc.) help.
Thanks in advance
Just use String.IndexOf(string, int) to find a "<td", again to find the next ">", and again to find "</td>". Then use String.Substring to pull out a value. Put this in a loop.
public static List<string> ParseTds(string input)
{
List<string> results = new List<string>();
int index = 0;
while (true)
{
string next = ParseTd(input, ref index);
if (next == null)
return results;
results.Add(next);
}
}
private static string ParseTd(string input, ref int index)
{
int tdIndex = input.IndexOf("<td", index);
if (tdIndex == -1)
return null;
int gtIndex = input.IndexOf(">", tdIndex);
if (gtIndex == -1)
return null;
int endIndex = input.IndexOf("</td>", gtIndex);
if (endIndex == -1)
return null;
index = endIndex;
return input.Substring(gtIndex + 1, endIndex - gtIndex - 1);
}
Assuming your string is valid XHTML, you can use use an XML parser to get the content you want. There's a simple example here that shows how to use XmlTextReader to parse XML content. The example reads from a file, but you can change it to read from a string:
new XmlTextReader(new StringReader(someString));
You specifically want to keep track of td element nodes, and the text node that follows them will contain the values you want.
Use a loop to load each non empty line from the file into a String
Process the string character by character
Check for characters indicating the the begining of a td tag
use a substring function or just bulild a new string character by character to get all the content until the </td> tag begins.

Categories

Resources