Cannot get HTML Attributes from HTML file (C#/WinForms) - c#

I have the following code that I managed to come up with:
private void button1_Click(object sender, EventArgs e)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (var o = new OpenFileDialog())
{
if (o.ShowDialog() == DialogResult.OK)
doc.Load(o.FileName);
}
foreach (HtmlAgilityPack.HtmlAttribute att in doc.DocumentNode.Attributes)
{
label1.Text += Environment.NewLine +
att.Name + " " + att.Value;
}
}
But it's not doing anything. There are no errors, no exceptions, and it compiles and runs. But, as you can see, from inside the foreach loop, it is supposed to keep adding found attributes and their values to the label1.Text control, but it isn't. Nothing happens!
Am I doing something wrong? Can someone please help?
Thank you

By iterating over doc.DocumentNode.Attributes, you are trying to get attributes of the root element (DocumentNode) which is a placeholder containing your <html> tag (and possibly some adjacent nodes like comments and white space). Which does not make a lot of sense.
What are you trying to extract exactly?

Related

Windows application to pull through xml data based on xpath with multiple conditions

I'm trying to create a program that reads the xml data and pulls through the data based on the xpath input. I have a couple of questions i need help with:
1) It is only pulling through the first node where I want it to pull through all nodes. My code is below, can you advise what I need to amend to do this so rather than just pulling through the first answer it pulls through all the data relevant to that path?
2) I have added a third textbox (which is not coded in yet) that you can input a condition so it works with the first xpath value The example is below:
/Clients/Client1/Name (This is the value)
Clients/Client1[Name = 'Smith'] (this is the condition)
I dont know how to code this part to work wih the first text box that is looking at the value?
private void button1_Click(object sender, EventArgs e)
{
if (!string.IsNullOrEmpty(lPrompt.Text) && !string.IsNullOrEmpty(textBox1.Text))
{
try
{
XmlDocument xdoc = new XmlDocument();
xdoc.Load(tbFile.Text);
XPathNavigator xnav = xdoc.CreateNavigator();
bool innerXml = checkBox2.Checked;
if (innerXml)
textBox2.Text = xnav.SelectSingleNode(textBox1.Text).InnerXml.ToString();
else
textBox2.Text = xnav.SelectSingleNode(textBox1.Text).Value.ToString();
}
catch (System.Exception ex)
{
textBox2.Text = ex.Message;
textBox2.ForeColor = System.Drawing.Color.Maroon;
Xpath Program Image
I know there is multiple programs I can use like Xpath Visualizer etc but I need this to be very specific to what I'm working on so I need to have a separate box where I can add conditions.

C# Save all items in a ListBox to text file

Recently I've been quite enjoying C# and I'm just testing with it but there seems to be one part I don't get.
Basically I want it so that when I click the SAVE button must save all the items in the listbox to a text file. At the moment all it comes up with in the file is System.Windows.Forms.ListBox+ObjectCollection.
Here's what I've got so far. With the SaveFile.WriteLine(listBox1.Items); part I've tried putting many different methods in and I can't seem to figure it out. Also take in mind that in the end product of my program I would like it to read back to that to that text file and output what's in the text file to the listbox, if this isn't possible then my bad, I am new to C# after all ;)
private void btn_Save_Click(object sender, EventArgs e)
{
const string sPath = "save.txt";
System.IO.StreamWriter SaveFile = new System.IO.StreamWriter(sPath);
SaveFile.WriteLine(listBox1.Items);
SaveFile.ToString();
SaveFile.Close();
MessageBox.Show("Programs saved!");
}
From your code
SaveFile.WriteLine(listBox1.Items);
your program actually does this:
SaveFile.WriteLine(listBox1.Items.ToString());
The .ToString() method of the Items collection returns the type name of the collection (System.Windows.Forms.ListBox+ObjectCollection) as this is the default .ToString() behavior if the method is not overridden.
In order to save the data in a meaningful way, you need to loop trough each item and write it the way you need. Here is an example code, I am assuming your items have the appropriate .ToString() implementation:
System.IO.StreamWriter SaveFile = new System.IO.StreamWriter(sPath);
foreach(var item in listBox1.Items)
{
SaveFile.WriteLine(item.ToString());
}
Items is a collection, you should iterate through all your items to save them
private void btn_Save_Click(object sender, EventArgs e)
{
const string sPath = "save.txt";
System.IO.StreamWriter SaveFile = new System.IO.StreamWriter(sPath);
foreach (var item in listBox1.Items)
{
SaveFile.WriteLine(item);
}
SaveFile.Close();
MessageBox.Show("Programs saved!");
}
There is one line solution to the problem.
System.IO.File.WriteAllLines(path, Listbox.Items.Cast<string>().ToArray());
put your file path+name and Listbox name in above code.
Example:
in Example below path and name of the file is D:\sku3.txt and list box name is lb
System.IO.File.WriteAllLines(#"D:\sku3.txt", lb.Items.Cast<string>().ToArray());

Get links of a html document in order

I want to get all the links of a HTML document. This isn't a problem, but apparently it puts all the links in an alphabetic order before storing them in an array one by one. I want to have the links in original order (not in alphabetic).
So is there any possibility to get the first found link, store it, then the second one,...? I already tried using HtmlAgilityPack and the Webbrowser-Control methods, but both order them alphabetically. The original order is important for later purposes.
I heard that it might be possible with Regex, but I've found enough answers, where they say that you shouldn't use it for HTML parsing. So how can I do it?
Here's the Webbrowser-Control code, I tried to use to get the links and store them into an array:
private void btnGet_Click(object sender, EventArgs e)
{
HtmlWindow mainFrame = webFl.Document.Window.Frames["mainFrame"];
HtmlElementCollection links = mainFrame.Document.Links;
foreach (HtmlElement link in links)
{
string linkText = link.OuterHtml;
if (linkText.Contains("puzzle"))
{
arr[i] = linkText;
i++;
}
}
}
Thank you in advance,
Opak
You can get the correct order by walking the DOM tree using HTML DOM API. The following code does this. Note, I use dynamic to access DOM API. That's because WebBrowser's HtmlElement.FirstChild/HtmlElement.NextSibling don't work for this purpose, as they return null for DOM text nodes.
private void btnGet_Click(object sender, EventArgs e)
{
Action<object> walkTheDom = null;
var links = new List<object>();
// element.FirstChild / NextSibling don't work as they stop at DOM text nodes
walkTheDom = (element) =>
{
dynamic domElement = element;
if (domElement.tagName == "A")
links.Add(domElement);
for (dynamic child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // Element node?
walkTheDom(child);
}
};
walkTheDom(this.webBrowser.Document.Body.DomElement);
string html = links.Aggregate(String.Empty, (a, b) => a + ((dynamic)b).outerHtml + Environment.NewLine);
MessageBox.Show(html);
}
[UPDATE] If you really need to get a list of HtmlElement objects for <A> tags, instead of dynamic native elements, that's still possible with a little trick using GetElementById:
private void btnGet_Click(object sender, EventArgs e)
{
// element.FirstChild / NextSibling don't work because they stop on DOM text nodes
var links = new List<HtmlElement>();
var document = this.webBrowser.Document;
dynamic domDocument = document.DomDocument;
Action<dynamic> walkTheDom = null;
walkTheDom = (domElement) =>
{
if (domElement.tagName == "A")
{
// get HtmlElement for the found <A> tag
string savedId = domElement.id;
string uniqueId = domDocument.uniqueID;
domElement.id = uniqueId;
links.Add(document.GetElementById(uniqueId));
if (savedId != null)
domElement.id = savedId;
else
domElement.removeAttribute("id");
}
for (var child = domElement.firstChild; child != null; child = child.nextSibling)
{
if (child.nodeType == 1) // is an Element node?
walkTheDom(child);
}
};
// walk the DOM for <A> tags
walkTheDom(domDocument.body);
// show the found tags
string combinedHtml = links.Aggregate(String.Empty, (html, element) => html + element.OuterHtml + Environment.NewLine);
MessageBox.Show(combinedHtml);
}

C# how to show available data one by one in textbox

Hi I am using HtmlAgilityPack to scrap some data from web using c# . Here is the code :
private void button1_Click(object sender, EventArgs e)
{
var url = this.textBox1.Text;
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach (var tag in metaTags)
{
var name = tag.Attributes["name"].Value;
var content = tag.Attributes["content"].Value;
this.textBox2.Text = name + " : " + content;
}
}
}
its getting a link from textbox1 and showing the output to textbox2 . Its showing the last available data . I can concate the available data but it will show all data at once. Actually I want to show one data when it is available while others are being processed so that user can realize the scrapping progress.Would anyone please help ??
Assuming the first half of your code is correct, this will show the data one-by-one. Most likely, the foreach loop is executing too quickly for you to notice the "one-by-one" effect.
You could apply an artificial delay if you need to display each result long enough for someone to read. Maybe with a Timer or something similar.
An alternative is to use a multi-line textbox (for example a RichTextBox) and place each result on a new-line. Or a ComboBox, which won't take up as much room. These options allow you to keep the data, without having to overwrite it.
You can user Timer control to delay the display of next value in the loop.
You can also use BackgroundWorker to get this done. However this would add complexity to your code. Please look at Beginners Guide to Threading in .NET: Part 5
Or you can use Thread.Sleep(xx) which will delay the execution for the milliseconds you specify.
Thread.Sleep(2000);
this.textBox2.Text += name + " : " + content + Environment.NewLine;

How to display the string html contents into webbrowser control?

I have a C# win app program. I save the text with html format in my database but I want to show it in a webbrowser to my user.How to display the string html contents into webbrowser control?
Try this:
webBrowser1.DocumentText =
"<html><body>Please enter your name:<br/>" +
"<input type='text' name='userName'/><br/>" +
"<a href='http://www.microsoft.com'>continue</a>" +
"</body></html>";
Instead of navigating to blank, you can do
webBrowser1.DocumentText="0";
webBrowser1.Document.OpenNew(true);
webBrowser1.Document.Write(theHTML);
webBrowser1.Refresh();
No need to wait for events or anything else. You can check the MSDN for OpenNew, while I have tested the initial DocumentText assignment in one of my projects and it works.
As commented by Thomas W. - I almost missed this comment but I had the same issues so it's worth rewriting as an answer I think.
The main issue being that after the first assignment of webBrowser1.DocumentText to some html, subsequent assignments had no effect.
The solution as linked by Thomas can be found in detail at http://weblogs.asp.net/gunnarpeipman/archive/2009/08/15/displaying-custom-html-in-webbrowser-control.aspx however I will summarize below in case this page becomes unavailable in the future.
In short, due to the way the webBrowser control works, you must navigate to a new page each time you wish to change the content. Therefore the author proposes a method to update the control as:
private void DisplayHtml(string html)
{
webBrowser1.Navigate("about:blank");
if (webBrowser1.Document != null)
{
webBrowser1.Document.Write(string.Empty);
}
webBrowser1.DocumentText = html;
}
I have however found that in my current application I get a CastException from the line if(webBrowser1.Document != null). I'm not sure why this is, but I've found that if I wrap the whole if block in a try catch the desired effect still works. See:
private void DisplayHtml(string html)
{
webBrowser1.Navigate("about:blank");
try
{
if (webBrowser1.Document != null)
{
webBrowser1.Document.Write(string.Empty);
}
}
catch (CastException e)
{ } // do nothing with this
webBrowser1.DocumentText = html;
}
So every time the function to DisplayHtml is executed I receive a CastException from the if statement, so the contents of the if statement are never reached. However if I comment out the if statement so as not to receive the CastException, then the browser control doesn't get updated. I suspect there is another side effect of the code behind the Document property which causes this effect despite the fact that it also throws an exception.
Anyway I hope this helps people.
For some reason the code supplied by m3z (with the DisplayHtml(string) method) is not working in my case (except first time). I'm always displaying html from string. Here is my version after the battle with the WebBrowser control:
webBrowser1.Navigate("about:blank");
while (webBrowser1.Document == null || webBrowser1.Document.Body == null)
Application.DoEvents();
webBrowser1.Document.OpenNew(true).Write(html);
Working every time for me. I hope it helps someone.
Simple solution, I've tested is
webBrowser1.Refresh();
var str = "<html><head></head><body>" + sender.ToString() + "</body></html>";
webBrowser1.DocumentText = str;
webBrowser.NavigateToString(yourString);
Here is a little code. It works (for me) at any subsequent html code change of the WebBrowser control. You may adapt it to your specific needs.
static public void SetWebBrowserHtml(WebBrowser Browser, string HtmlText)
{
if (Browser != null)
{
if (string.IsNullOrWhiteSpace(HtmlText))
{
// Putting a div inside body forces control to use div instead of P (paragraph)
// when the user presses the enter button
HtmlText =
#"<html>
<head>
<meta http-equiv=""Content-Type"" content=""text/html; charset=UTF-8"" />
</head>
<div></div>
<body>
</body>
</html>";
}
if (Browser.Document == null)
{
Browser.Navigate("about:blank");
//Wait for document to finish loading
while (Browser.ReadyState != WebBrowserReadyState.Complete)
{
Application.DoEvents();
System.Threading.Thread.Sleep(5);
}
}
// Write html code
dynamic Doc = Browser.Document.DomDocument;
Doc.open();
Doc.write(HtmlText);
Doc.close();
// Add scripts here
/*
dynamic Doc = Document.DomDocument;
dynamic Script = Doc.getElementById("MyScriptFunctions");
if (Script == null)
{
Script = Doc.createElement("script");
Script.id = "MyScriptFunctions";
Script.text = JavascriptFunctionsSourcecode;
Doc.appendChild(Script);
}
*/
// Enable contentEditable
/*
if (Browser.Document.Body != null)
{
if (Browser.Version.Major >= 9)
Browser.Document.Body.SetAttribute("contentEditable", "true");
}
*/
// Attach event handlers
// Browser.Document.AttachEventHandler("onkeyup", BrowserKeyUp);
// Browser.Document.AttachEventHandler("onkeypress", BrowserKeyPress);
// etc...
}
}
Old question, but here's my go-to for this operation.
If browser.Document IsNot Nothing Then
browser.Document.OpenNew(True)
browser.Document.Write(My.Resources.htmlTemplate)
Else
browser.DocumentText = My.Resources.htmlTemplate
End If
And be sure that any browser.Navigating event DOES NOT cancel "about:blank" URLs. Example event below for full control of WebBrowser navigating.
Private Sub browser_Navigating(sender As Object, e As WebBrowserNavigatingEventArgs) Handles browser.Navigating
Try
Me.Cursor = Cursors.WaitCursor
Select Case e.Url.Scheme
Case Constants.App_Url_Scheme
Dim query As Specialized.NameValueCollection = System.Web.HttpUtility.ParseQueryString(e.Url.Query)
Select Case e.Url.Host
Case Constants.Navigation.URLs.ToggleExpander.Host
Dim nodeID As String = query.Item(Constants.Navigation.URLs.ToggleExpander.Parameters.NodeID)
:
:
<other operations here>
:
:
End Select
Case Else
e.Cancel = (e.Url.ToString() <> "about:blank")
End Select
Catch ex As Exception
ExceptionBox.Show(ex, "Operation failed.")
Finally
Me.Cursor = Cursors.Default
End Try
End Sub
The DisplayHtml(string html) recommended by m3z worked for me.
In case it helps somebody, I would also like to mention that initially there were some spaces in my HTML that invalidated the HTML and so the text appeared as a string. The spaces were introduced (around the angular brackets) when I pasted the HTML into Visual Studio. So if your text is still appearing as text after you try the solutions mentioned in this post, then it may be worth checking that the HTML syntax is correct.

Categories

Resources