I have this HTML code
<div class="anc-style" onclick="window.open('./view.php?a=foo')"></div>
I'd like to extract the contents of the "onclick" attribute. I've attempted to do something like:
div.GetAttribute("onclick").ToString();
Which would ideally yield the string
"window.open('./view.php?a=foo')"
but it returns a System.__ComObject.
I'm able to get the class by changing ("onclick") to ("class"), what's going on with the onclick?
HtmlElementCollection div = webBrowser1.Document.GetElementsByTagName("div");
for (int j = 0; j < div.Count; j++) {
if (div[j].GetAttribute("class") == "anc-style") {
richTextBox1.AppendText(div[j].GetAttribute("onclick").ToString());
}
}
You can pull the document tags and extract data such as below using the htmlDocument class. This is only an example
string htmlText = "<html><head></head><body><div class=\"anc-style\" onclick=\"window.open('./view.php?a=foo')\"></div></body></html>";
WebBrowser wb = new WebBrowser();
wb.DocumentText = "";
wb.Document.Write(htmlText);
foreach (HtmlElement hElement in wb.Document.GetElementsByTagName("DIV"))
{
//get start and end positions
int iStartPos = hElement.OuterHtml.IndexOf("onclick=\"") + ("onclick=\"").Length;
int iEndPos = hElement.OuterHtml.IndexOf("\">",iStartPos);
//get our substring
String s = hElement.OuterHtml.Substring(iStartPos, iEndPos - iStartPos);
MessageBox.Show(s);
}
try also using div[j]["onclick"] what browser are you using?
I've created a jsfiddle that works try out and see if its working for you
http://jsfiddle.net/4ZwNs/102/
Related
I'm using Html Agility Pack on a website to extract some data. Parsing some of the HTML I need is easy but I am having trouble with this (slightly complex?) piece of HTML.
<tr>
<td>
<div onmouseover="toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br /><br /><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')" onmouseout="toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')" onclick="togglestick('clue_J_1_1_stuck')">
...
I need to get the value from the em class "correct_response" in the onmouseover div based on the clue_J_X_Y value. I really don't know how to go beyond this..
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//tr//td/div[#onmouseover]");
Some help would be appreciated.
I don't know what you're supposed to get out from the em. But I will give you all the data you say you need to figure it out.
First we load the HTML.
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
Then we get the value of the attribute, onmouseover.
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
It will return FAILED if it failed to find an attribute named "onmouseover". Now we get the parameters of the toggle method where each are enclosed by two '(apostrophe).
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
After that we have a list with 3 entities. In this case it will contain the following.
clue_J_1_1
clue_J_1_1_stuck
<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>;
Now we can create a new HtmlDocument with the HTML code from the third parameter. But first we have to convert it into workable HTML since the third parameter contains escape characters from HTML.
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
And done. The code in it's entirety is below from here.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
using HtmlAgilityPack;
using System.Web;
namespace test
{
class Program
{
public static void Main(string[] args)
{
string html = "<tr>" +
"<td>" +
"<div onmouseover = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', '<em class="correct_response">Obama</em><br/><br/><table width="100%"><tr><td class="right">Kailyn</td></tr></table>')\" onmouseout = \"toggle('clue_J_1_1', 'clue_J_1_1_stuck', 'Michelle LaVaughn Robinson')\" onclick = \"togglestick('clue_J_1_1_stuck')\"></div></td></tr>";
HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Console.WriteLine(doc.DocumentNode.OuterHtml);
string toggle = doc.DocumentNode.SelectSingleNode("//tr//td/div[#onmouseover]").GetAttributeValue("onmouseover", "FAILED");
//Clean up string
//Console.WriteLine(toggle);
//Get Variables from toggle()
List<string> toggleVariables = new List<string>();
bool flag = false; string temp = "";
for(int i=0; i<toggle.Length; i++)
{
if (toggle[i] == '\'' && flag== true)
{
toggleVariables.Add(temp);
temp = "";
flag = false;
}
else if (flag)
{
temp += toggle[i];
}
else if (toggle[i] == '\'')
{
flag = true;
}
}
//Make it into workable HTML
toggleVariables[2] = HttpUtility.HtmlDecode(toggleVariables[2]);
//New HtmlDocument
HtmlDocument htmlInsideToggle = new HtmlDocument();
htmlInsideToggle.LoadHtml(toggleVariables[2]);
Console.WriteLine(htmlInsideToggle.DocumentNode.OuterHtml);
//You're on your own from here
Console.ReadKey();
}
}
I get an error
the given paths format is not supported
but when I use driver.Title instead of links[i] it works properly, just there is so many same titles and because of that for me its better to use href, but I guess that you cant use ":" or "/" in a file name, so how to simplify href, so I will not get "not supported path" error?
int linkCount = driver.FindElements(By.CssSelector("a[href]")).Count;
string[] links = new string[linkCount];
List<IWebElement> linksToClick = driver.FindElements(By.CssSelector("a[href]")).ToList();
for (int i = 0; i < linkCount; i++)
{
links[i] = linksToClick[i].GetAttribute("href");
}
for (int i = 0; i < linkCount; i++)
{
driver.Navigate().GoToUrl(links[i]);
ITakesScreenshot screenshotDriver = driver as ITakesScreenshot;
Screenshot screenCapture = screenshotDriver.GetScreenshot();
screenCapture.SaveAsFile(Path.Combine(testPath, links[i] +"_"+ testScreenshotTitle),
System.Drawing.Imaging.ImageFormat.Png);
}
If the goal is to get list of links on the page except for a specific links, maybe this would work better
using System.Link;
var blackList = {"LogOff", ...};
var links = driver
.FindElements(By.CssSelector("a[href]"))
.Select(a => a.GetAttribute("href"))
.Where(u => !blackList.Any(s => s.Contains(u)));
foreach (string link in links)
{
...
}
Update
So to sanitize a file name
foreach (string link in links)
{
var fileName = Path.Combine(testPath, link + "_" + testScreenshotTitle;
foreach (char c in Path.GetInvalidFileNameChars())
{
fileName = fileName.Replace(c, '_');
}
...
}
private void ParseFilesNames()
{
using (WebClient client = new WebClient())
{
try
{
for (int i = 0; i < 15; i++)
{
string urltoparse = "mysite.com/gallery/albums/from_old_gallery/" + i;
string s = client.DownloadString(urltoparse);
int index = -1;
while (true)
{
string firstTag = "HREF=";
string secondtag = ">";
index = s.IndexOf(firstTag, 0);
int endIndex = s.IndexOf(secondtag, index);
if (index < 0)
{
break;
}
else
{
string filename = s.Substring(index + firstTag.Length, endIndex - index - firstTag.Length);
}
}
}
}
catch (Exception err)
{
}
}
}
The problem is with the Substring. index + firstTag.Length, endIndex - index - firstTag.Length
This is wrong.
What I need to get is the string between: HREF=" and ">
The whole string looks like: HREF="myimage.jpg">
I need to get only "myimage.jpg"
And sometimes it can be "myimage465454.jpg" so in any case I need to get only the file name. Only "myimage465454.jpg".
What should I change in the substring?
If you are sure that your string will always be < HREF="yourpath" > , just apply the following:
string yourInitialString = #"HREF="myimage.jpg"";
string parsedString = yourInitialString.Replace(#"<HREF="").Replace(#"">");
If you need to parse HTML links href values, the best option will be using HtmlAgilityPack library.
Solution with Html Agility Pack :
HtmlWeb htmlWeb = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = htmlWeb.Load(Url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
// Get the value of the HREF attribute
string hrefValue = link.GetAttributeValue( "href", string.Empty );
}
To install HtmlAgilityPack, run the following command in the Package Manager Console:
PM> Install-Package HtmlAgilityPack
Hope it helps.
Try this:
String filename = input.split("=")[1].replace("\"","").replace(">","");
Problem:
I am passing HTML and creating pdf through ABC pdf.
But the CSS are not applied on the content and pdf created is not as expected.
Here is my code can u please suggest what is the problem or how we can apply CSS...
public static String CreateHtmlFile(String strHtmlCode)
{
String Modifiedhtml = #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html class="" _Telerik_IE9"" xmlns=""http://www.w3.org/1999/xhtml"">" + strHtmlCode;
Modifiedhtml = Modifiedhtml.Remove(Modifiedhtml.IndexOf(#"//<![CDATA["), (Modifiedhtml.IndexOf("//]]>") - Modifiedhtml.IndexOf(#"//<![CDATA[")));
string[] stringSeparators = new string[] { "PdfCreator" };
var baseUrl = HttpContext.Current.Request.Url.AbsoluteUri.Split(stringSeparators, StringSplitOptions.RemoveEmptyEntries).First();
Modifiedhtml = Modifiedhtml.Replace(#"href=""../", (#"href=""" + baseUrl));
Modifiedhtml = Modifiedhtml.Replace(#"href=""/", (#"href=""" + baseUrl));
Doc theDoc = new Doc();
theDoc.HtmlOptions.UseScript = false;
//theDoc.Width = 1125;
String s = string.Empty;
//s = File.ReadAllText(#"D:\test.html");
theDoc.Page = theDoc.AddPage();
int theID;
theID = theDoc.AddHtml(strHtmlCode);
//theID = theDoc.AddHtml(s);
while (true)
{
theDoc.FrameRect(); // add a black border
if (!theDoc.Chainable(theID))
break;
theDoc.Page = theDoc.AddPage();
theID = theDoc.AddImageToChain(theID);
}
for (int i = 1; i <= theDoc.PageCount; i++)
{
theDoc.PageNumber = i;
theDoc.Flatten();
}
theDoc.Save(#"D:\two\pagedhtml4.pdf");
theDoc.Clear();
return String.Empty;
}
strHtmlCode is the HTML of the page which we have to convert in PDF.
Thanks in advance
From the WebSupergoo doc page on the AddHtml Function:
Adds a block of HTML styled text to the current page.
HTML styled text does not support CSS. For full featured, standard CSS, you want AddImageHtml.
You are passing strHtmlCode into the AddHtml function. It looks like you really want to pass in Modifiedhtml instead.
I'm using html agility pack to parse several text files that I load. I then save the data that I parse out into a string list for further processing. However, when I use this method, it never hits the line:
MessageBox.Show("test");
Additionally, if I include any other code following this method, none of it is triggered.
Does anyone have any suggestions as to my error?
The entire method is included below:
private void ParseOutput()
{
nodeDupList = new List<string>();
StreamWriter OurStream;
OurStream = File.CreateText(dir + #"\CombinedPages.txt");
OurStream.Close();
for (int crawl = 1; crawl <= crawlPages.Length; crawl++)
{
var web = new HtmlWeb();
var doc = web.Load(dir + #"\Pages" + crawl.ToString() + ".txt");
var nodeCount = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[td/#class=""style_23""]");
int nCount = nodeCount.Count;
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
}
MessageBox.Show("test");
}
I've created a crawler to save multiple html pages to text and parse them separately using this method.
I'm just using MessageBox to show that it won't continue following the "for loop". I've called multiple methods in my solution and it won't iterate through them.
The application is a Win Forms Application targeted at .Net Framework 4.
Edit:
Thanks for the help.
I realized after rerunning it through the debugger that it was crashing at times on the loop
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
when the var specContent was null.
There was no exception generated; the method just ended.
As the website is dynamic that I was crawling it rarely returned null but on several instances it had and this happened.
The solution, for anyone who might need this is to check if
for (int a = 3; a <= nCount; a++)
{
var specContent = doc.DocumentNode.SelectNodes(#"/html[1]/body[1]/div[1]/table[3]/tbody[1]/tr[" + a + #"]/td[3]/div[contains(#class,'style_24')]");
if(specContent !=null) //added this check for null
{
foreach (HtmlNode node in specContent)
{
nodeDupList.Add(node.InnerText + ".d");
}
}
}
I also could have used a try{} catch{} block to output the error if needed