I'm using Html Agility Pack on a website to extract some data. Parsing some of the HTML I need is easy but I am having trouble with this (slightly complex?) piece of HTML.
<tr>
<td>
<div onmouseover="toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br /><br /><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')" onmouseout="toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')" onclick="togglestick('clue_J_1_1_stuck')">
...
I need to get the value from the em class "correct_response" in the onmouseover div based on the clue_J_X_Y value. I really don't know how to go beyond this..
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//tr//td/div[#onmouseover]");
Some help would be appreciated.
I don't know what you're supposed to get out from the em. But I will give you all the data you say you need to figure it out.
First we load the HTML.
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
Then we get the value of the attribute, onmouseover.
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
It will return FAILED if it failed to find an attribute named "onmouseover". Now we get the parameters of the toggle method where each are enclosed by two '(apostrophe).
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
After that we have a list with 3 entities. In this case it will contain the following.
clue_J_1_1
clue_J_1_1_stuck
<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>;
Now we can create a new HtmlDocument with the HTML code from the third parameter. But first we have to convert it into workable HTML since the third parameter contains escape characters from HTML.
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
And done. The code in it's entirety is below from here.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using HtmlAgilityPack;
using System.Web;
namespace test
{
class Program
{
public static void Main(string[] args)
{
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
//Clean up string
//Console.WriteLine(toggle);
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
//You're on your own from here
Console.ReadKey();
}
}
Related
I have a report that I'm trying to generate using iTextSharp that includes html text entered by the user using tinymce on my web page. I then have a report and I want to insert a phrase that uses their markup.
While basic markup such as bold and underline work, lists, indents, alignment do not. Any suggestions short of writing my own little html to pdf parser?
My code:
internal static Phrase GetPhraseFromHtml(string html, string fontName, int fontSize)
{
var returnPhrase = new Phrase();
html.Replace(Environment.NewLine, String.Empty);
//the string has to be well formated html in order to work and has to specify the font since
//specifying the font in the phrase overrides the formatting of the html tags.
string pTag = string.Format("<p style='font-size: {0}; font-family:{1}'>", fontSize, fontName);
if (html.StartsWith("<p>"))
{
html = html.Replace("<p>", pTag);
}
else
{
html = pTag + html + "</p>";
}
html
= "<html><body>"
+ html
+ "</body></html>";
using (StringWriter sw = new StringWriter())
{
using (System.Web.UI.HtmlTextWriter hw = new System.Web.UI.HtmlTextWriter(sw))
{
var xmlWorkerHandler = new XmlWorkerHandler();
//Bind a reader to our text
using (TextReader textReader = new StringReader(html))
{
//Parse
XMLWorkerHelper.GetInstance().ParseXHtml(xmlWorkerHandler, textReader);
}
var addPhrase = new Phrase();
var elementText = new StringBuilder();
bool firstElement = true;
//Loop through each element
foreach (var element in xmlWorkerHandler.elements)
{
if (firstElement)
{
firstElement = false;
}
else
{
addPhrase.Add(new Chunk("\n"));
}
//Loop through each chunk in each element
foreach (var chunk in element.Chunks)
{
addPhrase.Add(chunk);
}
returnPhrase.Add(addPhrase);
addPhrase = new Phrase();
}
return returnPhrase;
}
}
}
private void ParseFilesNames()
{
using (WebClient client = new WebClient())
{
try
{
for (int i = 0; i < 15; i++)
{
string urltoparse = "mysite.com/gallery/albums/from_old_gallery/" + i;
string s = client.DownloadString(urltoparse);
int index = -1;
while (true)
{
string firstTag = "HREF=";
string secondtag = ">";
index = s.IndexOf(firstTag, 0);
int endIndex = s.IndexOf(secondtag, index);
if (index < 0)
{
break;
}
else
{
string filename = s.Substring(index + firstTag.Length, endIndex - index - firstTag.Length);
}
}
}
}
catch (Exception err)
{
}
}
}
The problem is with the Substring. index + firstTag.Length, endIndex - index - firstTag.Length
This is wrong.
What I need to get is the string between: HREF=" and ">
The whole string looks like: HREF="myimage.jpg">
I need to get only "myimage.jpg"
And sometimes it can be "myimage465454.jpg" so in any case I need to get only the file name. Only "myimage465454.jpg".
What should I change in the substring?
If you are sure that your string will always be < HREF="yourpath" > , just apply the following:
string yourInitialString = #"HREF="myimage.jpg"";
string parsedString = yourInitialString.Replace(#"<HREF="").Replace(#"">");
If you need to parse HTML links href values, the best option will be using HtmlAgilityPack library.
Solution with Html Agility Pack :
HtmlWeb htmlWeb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = htmlWeb.Load(Url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
// Get the value of the HREF attribute
string hrefValue = link.GetAttributeValue( "href", string.Empty );
}
To install HtmlAgilityPack, run the following command in the Package Manager Console:
PM> Install-Package HtmlAgilityPack
Hope it helps.
Try this:
String filename = input.split("=")[1].replace("\"","").replace(">","");
Problem:
I am passing HTML and creating pdf through ABC pdf.
But the CSS are not applied on the content and pdf created is not as expected.
Here is my code can u please suggest what is the problem or how we can apply CSS...
public static String CreateHtmlFile(String strHtmlCode)
{
String Modifiedhtml = #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html class="" _Telerik_IE9"" xmlns=""http://www.w3.org/1999/xhtml"">" + strHtmlCode;
Modifiedhtml = Modifiedhtml.Remove(Modifiedhtml.IndexOf(#"//<![CDATA["), (Modifiedhtml.IndexOf("//]]>") - Modifiedhtml.IndexOf(#"//<![CDATA[")));
string[] stringSeparators = new string[] { "PdfCreator" };
var baseUrl = HttpContext.Current.Request.Url.AbsoluteUri.Split(stringSeparators, StringSplitOptions.RemoveEmptyEntries).First();
Modifiedhtml = Modifiedhtml.Replace(#"href=""../", (#"href=""" + baseUrl));
Modifiedhtml = Modifiedhtml.Replace(#"href=""/", (#"href=""" + baseUrl));
Doc theDoc = new Doc();
theDoc.HtmlOptions.UseScript = false;
//theDoc.Width = 1125;
String s = string.Empty;
//s = File.ReadAllText(#"D:\test.html");
theDoc.Page = theDoc.AddPage();
int theID;
theID = theDoc.AddHtml(strHtmlCode);
//theID = theDoc.AddHtml(s);
while (true)
{
theDoc.FrameRect(); // add a black border
if (!theDoc.Chainable(theID))
break;
theDoc.Page = theDoc.AddPage();
theID = theDoc.AddImageToChain(theID);
}
for (int i = 1; i <= theDoc.PageCount; i++)
{
theDoc.PageNumber = i;
theDoc.Flatten();
}
theDoc.Save(#"D:\two\pagedhtml4.pdf");
theDoc.Clear();
return String.Empty;
}
strHtmlCode is the HTML of the page which we have to convert in PDF.
Thanks in advance
From the WebSupergoo doc page on the AddHtml Function:
Adds a block of HTML styled text to the current page.
HTML styled text does not support CSS. For full featured, standard CSS, you want AddImageHtml.
You are passing strHtmlCode into the AddHtml function. It looks like you really want to pass in Modifiedhtml instead.
I have this HTML code
<div class="anc-style" onclick="window.open('./view.php?a=foo')"></div>
I'd like to extract the contents of the "onclick" attribute. I've attempted to do something like:
div.GetAttribute("onclick").ToString();
Which would ideally yield the string
"window.open('./view.php?a=foo')"
but it returns a System.__ComObject.
I'm able to get the class by changing ("onclick") to ("class"), what's going on with the onclick?
HtmlElementCollection div = webBrowser1.Document.GetElementsByTagName("div");
for (int j = 0; j < div.Count; j++) {
if (div[j].GetAttribute("class") == "anc-style") {
richTextBox1.AppendText(div[j].GetAttribute("onclick").ToString());
}
}
You can pull the document tags and extract data such as below using the htmlDocument class. This is only an example
string htmlText = "<html><head></head><body><div class=\"anc-style\" onclick=\"window.open('./view.php?a=foo')\"></div></body></html>";
WebBrowser wb = new WebBrowser();
wb.DocumentText = "";
wb.Document.Write(htmlText);
foreach (HtmlElement hElement in wb.Document.GetElementsByTagName("DIV"))
{
//get start and end positions
int iStartPos = hElement.OuterHtml.IndexOf("onclick=\"") + ("onclick=\"").Length;
int iEndPos = hElement.OuterHtml.IndexOf("\">",iStartPos);
//get our substring
String s = hElement.OuterHtml.Substring(iStartPos, iEndPos - iStartPos);
MessageBox.Show(s);
}
try also using div[j]["onclick"] what browser are you using?
I've created a jsfiddle that works try out and see if its working for you
http://jsfiddle.net/4ZwNs/102/
I want to go to multiple pages using ASP.NET 4.0, copy all HTML and then finally paste it in a text box. From there I would like to run my parsing function, what is the best way to handle this?
protected void goButton_Click(object sender, EventArgs e)
{
if (datacenterCombo.Text == "BL2")
{
fwURL = "http://website1.com/index.html";
l2URL = "http://website2.com/index.html";
lbURL = "http://website3.com/index.html";
l3URL = "http://website4.com/index.html";
coreURL = "http://website5.com/index.html";
WebRequest objRequest = HttpWebRequest.Create(fwURL);
WebRequest layer2 = HttpWebRequest.Create(l2URL);
objRequest.Credentials = CredentialCache.DefaultCredentials;
using (StreamReader layer2 = new StreamReader(layer2.GetResponse().GetResponseStream()))
using (StreamReader objReader = new StreamReader(objRequest.GetResponse().GetResponseStream()))
{
originalBox.Text = objReader.ReadToEnd();
}
objRequest = HttpWebRequest.Create(l2URL);
//Read all lines of file
String[] crString = { "<BR> " };
String[] aLines = originalBox.Text.Split(crString, StringSplitOptions.RemoveEmptyEntries);
String noHtml = String.Empty;
for (int x = 0; x < aLines.Length; x++)
{
if (aLines[x].Contains(ipaddressBox.Text))
{
noHtml += (RemoveHTML(aLines[x]) + "\r\n");
}
}
//Print results to textbox
resultsBox.Text = String.Join(Environment.NewLine, noHtml);
}
}
public static string RemoveHTML(string text)
{
text = text.Replace(" ", " ").Replace("<br>", "\n");
var oRegEx = new System.Text.RegularExpressions.Regex("<[^>]+>");
return oRegEx.Replace(text, string.Empty);
}
Instead of doing all this manually you should probably use HtmlAgilityPack instead then you could do something like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://google.com");
var targetNodes = doc.DocumentNode
.Descendants()
.Where(x=> x.ChildNodes.Count == 0
&& x.InnerText.Contains(someIpAddress));
foreach (var node in targetNodes)
{
//do something
}
If HtmlAgilityPack is not an option for you, simplify at least the download portion of your code and use a WebClient:
using (WebClient wc = new WebClient())
{
string html = wc.DownloadString("http://google.com");
}