Is it possible to get element from webpage? - c#

In my program, I'm using webbrowser (C#) and I want to get all element from the current page to text. Can anyone help me?
Code:
HtmlElement htmlelement = webBrowser1.Document.GetElementById("html");
if (htmlelement == null)
{
}
else
{
richTextBox1.Text = webBrowser1.Document.GetElementById("html").OuterText;
}
Ps. OuterHtml can use on this?

You can use the HTML Agility Pack
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.example.com/");
HtmlNodeCollection tags = doc.DocumentNode.SelectNodes("//tag1//tag2");

In Java i use the below code, some reformating should fetch you C# code.
List<WebElement> webPageElements = driver.findElementsByTagName(webHTMLTagName);
// Loop Over All WebPage Elements with same Tag Type
for (WebElement webElement : webPageElements) {
System.out.println(webElement.getAttribute("type"));
System.out.println(webElement.getAttribute("name"));
System.out.println(webElement.getAttribute("id"));
}
webHTMLTagName can be your html tag.(for eg., "input")

Related

C# Strip HTML Markup in XML

i really hope someone can help me with this issue. The solution should be on C#.
I have a xml file with the size of 36 MB and with 900k lines. On some nodes it has a lot of html markup and some invalid markup like
<Obs><p>
<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p>
I've tried different ways to clean this file but only one way is able to perform the task, however, as this is being executed on a web application it's blocking the application and taking around 6 minutes to finish the task and consuming around 450MB in memory.
As this file is an invalid xml i cannot use XmlTextReader.
Using XLST, based on Strip HTML-like characters (not markup) from XML with XSLT? ,strangely i'm also with problems with HTML Entities.
The process that worked (with some tweaks) is the following on http://www.codeproject.com/Articles/19652/HTML-Tag-Stripper
Thanks
Edit:
Following Kevin's suggestions. I'm trying to build a solution using HTML Agility Pack.
At least to do some benchmarks.
I'm stuck however. Imagine the following xml node:
<Obs><p> I WANT THIS TEXT<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></Obs>
How can i strip the tags inside "obs" tag, keep the tag "obs" and also keep the text "I WANT THIS TEXT" ? Basically this:
<Obs>I WANT THIS TEXT</Obs>
For now this is the code i have:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
foreach (HtmlNode nodeToStrip in childNodes)
nodeToStrip.ParentNode.RemoveChild(nodeToStrip);
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Thanks :)
EDIT 2
Ok, i was able to complete the task. However this is taking too much time. About 3 hours and consuming 800MB in memory.
Still needing help!
Here is the code, it might help someone.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
if (childNodes != null)
{
foreach (HtmlNode nodeToStrip in childNodes)
{
var replacement = doc.CreateTextNode(nodeToStrip.InnerText);
nodeToStrip.ParentNode.ReplaceChild(replacement, nodeToStrip);
}
}
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Have you tried Html Agility Pack? Among its claims:
the parser is very tolerant with "real world" malformed HTML
you can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it

How do you find and extract information from a table of a website?

I am very new to C# and in specific HtmlAgilityPack, and I am having trouble getting information from websites. For example I want to get the images url from the table of the website:
Serebii
From the website I am trying to find and extract the following:
string s = "http://www.serebii.net/pokedex-rs/005.shtml";
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load(s);
//HtmlNodeCollection items = doc.DocumentNode.SelectNodes("//a[#class='question-hyperlink']");
HtmlNodeCollection items = doc.DocumentNode.SelectNodes("//table//tr//td//div//table//tbody//tr//td//img");
foreach (HtmlNode item in items)
{
Console.WriteLine(item.OuterHtml);
MessageBox.Show(item.OuterHtml);
}
Console.ReadLine();
I am fairly certain I am way off the ball, any help would be appreciated.
You can only hope that the developer doesn't like to update the source code often.
var item = doc.DocumentNode.SelectSingleNode("//table//tr//tr//td//div//tr//img");
string imageSrc = item.GetAttributeValue("src", "");
Console.WriteLine(imageSrc);

WPF WebBrowser HTMLDocument

I'm trying to inject some javascript code to prevent javascript error popup, but I cannot find HTMLDocument and IHTMLScriptElement in WPF:
var doc = browser.Document as HTMLDocument;
if (doc != null)
{
//Create the sctipt element
var scriptErrorSuppressed = (IHTMLScriptElement)doc.createElement("SCRIPT");
scriptErrorSuppressed.type = "text/javascript";
scriptErrorSuppressed.text = m_disableScriptError;
//Inject it to the head of the page
IHTMLElementCollection nodes = doc.getElementsByTagName("head");
foreach (IHTMLElement elem in nodes)
{
var head = (HTMLHeadElement)elem;
head.appendChild((IHTMLDOMNode)scriptErrorSuppressed);
}
}
To clarify, Microsoft.mshtml isn't the 'using', it's a reference.
Full solution:
Add a project reference to Microsoft.mshtml
Add using mshtml;
I have solved the problem by using:
Microsoft.mshtml;

HTML Agility Pack - How can append element at the top of Head element?

I'm trying to use HTML Agility Pack to append a script element into the top of the HEAD section of my html. The examples I have seen so far just use the AppendChild(element) method to accomplish this. I need the script that I am appending to the head section to come before some other scripts. How can I specify this?
Here's what I'm trying:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.Load(filePath);
HtmlNode head = htmlDocument.DocumentNode.SelectSingleNode("/html/head");
HtmlNode stateScript = htmlDocument.CreateElement("script");
head.AppendChild(stateScript);
stateScript.SetAttributeValue("id", "applicationState");
stateScript.InnerHtml = "'{\"uid\":\"testUser\"}'";
I would like a script tag to be added toward the top of HEAD rather than appended at the end.
Realizing that this is an old question, there is also the possibility of prepending child elements that might not have existed then.
// Load content as new Html document
HtmlDocument html = new HtmlDocument();
html.LoadHtml(oldContent);
// Wrapper acts as a root element
string newContent = "<div>" + someHtml + "</div>";
// Create new node from newcontent
HtmlNode newNode = HtmlNode.CreateNode(newContent);
// Get body node
HtmlNode body = html.DocumentNode.SelectSingleNode("//body");
// Add new node as first child of body
body.PrependChild(newNode);
// Get contents with new node
string contents = html.DocumentNode.InnerHtml;
Got it..
HtmlNode has the following methods:
HtmlNode.InsertBefore(node, refNode)
HtmlNode.InsertAfter(nodeToAdd, refNode)

Will I use HtmlDocument even I want to parse the HTML string using HtmlAglityPack?

I'm working in C#. I'm trying to extract the first instance of img tag from a HTML string (which is actually a post data).
This is my code:
private string GrabImage(string htmlContent)
{
String firstImage;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
HtmlAgilityPack.HtmlNode imageNode = htmlDoc.DocumentNode.SelectSingleNode("//img");
if (imageNode != null)
{
return firstImage = imageNode.ToString();
}
else
return firstImage=" ";
}
But it gets null in htmlDoc, will I use the HtmlDocument type even if I'm trying to parse the HTML from a string ?
P.S btw is it the correct way of grabbing the first instance of image tag from my HTML string?
Using the HTML you provided, I made this console application.
static void Main(string[] args)
{
var image = GrabImage("<h2>How to learn Photoshop</h2><p> Its link</p><br /> <img src=\"image.jpg\" alt=\"image\"/>");
Console.WriteLine(image);
Console.ReadLine();
}
private static string GrabImage(string htmlContent)
{
String firstImage;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
HtmlAgilityPack.HtmlNode imageNode = htmlDoc.DocumentNode.SelectSingleNode("//img");
if (imageNode != null)
{
firstImage = imageNode.OuterHtml.ToString();
}
else
firstImage = " ";
return firstImage;
}
I'm unable to find the problem youwere describing. Could you show where you called the GrabImage method?
For the P.S. part, you'll want to make sure the imageNode's html text is returned, not the name of the object.
I'll try to add an additional part for the document when I'm at a computer with the agility pack available.

Categories

Resources