Replacing A Node Using HtmlAgilityPack Throwing Strange Error - c#

I have a webpage that displays a table the user can edit. After the edits are made I want to save the table as a .html file that I can convert to an image later. I am doing this by overriding the render method. However, I want to remove two buttons and a DropDownList from the final version so that I just get the table by itself. Here is the code I am currently trying:
protected override void Render(HtmlTextWriter writer)
{
using (HtmlTextWriter htmlwriter = new HtmlTextWriter(new StringWriter()))
{
base.Render(htmlwriter);
string renderedContent = htmlwriter.InnerWriter.ToString();
string output = renderedContent.Replace(#"<input type=""submit"" name=""viewReport"" value=""View Report"" id=""viewReport"" />", "");
output = output.Replace(#"<input type=""submit"" name=""redoEdits"" value=""Redo Edits"" id=""redoEdits"" />", "");
var doc = new HtmlDocument();
doc.LoadHtml(output);
var query = doc.DocumentNode.Descendants("select");
foreach (var item in query.ToList())
{
var newNodeStr = "<div></div>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
File.WriteAllText(currDir + "\\outputFile.html", output);
writer.Write(renderedContent);
}
}
Where I have adapted this solution found in another SO post about replacing nodes with HtmlAgilityPack:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
and here is the rendered HTML I am trying to alter:
<select name="Archives" onchange="javascript:setTimeout('__doPostBack(\'Archives\',\'\')', 0)" id="Archives" style="width:200px;">
<option selected="selected" value="Dashboard_Jul-2012">Dashboard_Jul-2012</option>
<option value="Dashboard_Jun-2012">Dashboard_Jun-2012</option>
</select>
The two calls to Replace are working as expected and removing the buttons. However this line:
var query = doc.DocumentNode.Descendants("select");
is throwing this error:
Method not found: 'Int32 System.Environment.get_CurrentManagedThreadId()'.
Any advice is appreciated.
Regards.

Seems like you are using the .Net 4.5 Version of the Agility Pack in a project targeting .Net or lower, you just have to either change the reference of the Dll to the one compiled for your Framework version or change your project to .Net 4.5 (if you're using VS 2012 that is).

Related

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:
System.NullReferenceException HResult=0x80004003 Message=Object
reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in
C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16
Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)
Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...
Here is the process to be done:
Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
Proceed to each link and search for PDF files. PDF files are in:
<div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
Here is my full code for testing:
using HtmlAgilityPack;
using System;
using System.Net;
namespace NW_Crawler
{
class Program
{
static void Main(string[] args)
{
{
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']//a");
Console.WriteLine("Here are the links:" + ProductListPage);
foreach (HtmlNode src in ProductListPage)
{
htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);
// Thread.Sleep(5000); // wait some time
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
{
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.Attributes["href"].Value;
Console.WriteLine(LinkURL);
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
var DLClient = new WebClient();
// Thread.Sleep(5000); // wait some time
DLClient.DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
}
}
}
}
}
Made a couple of changes to cover the errors you might be seeing.
Changes
Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[#class='ap-area-link'] from //a[#class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[#class='ap-area-link']");
if (ProductListPage != null)
foreach (HtmlNode src in ProductListPage)
{
string href = src.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(href))
continue;
htmlDoc = new HtmlWeb().Load(href);
HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[#class='dl-items']//a");
if (LinkTester != null)
foreach (var dllink in LinkTester)
{
string LinkURL = dllink.GetAttributeValue("href", string.Empty);
if (string.IsNullOrEmpty(LinkURL))
continue;
string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
new WebClient().DownloadFileAsync(new Uri(LinkURL), #"C:\temp\" + ExtractFilename);
}
}
The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[#class='ap-area-link'] and was able to find matching elements, screenshot below.

System.ArgumentNullException when trying to access span with Xpath (C#)

So i've been trying to get a program working where I get info from google finance regarding different stock stats. So far I have not been able to get information out of spans. As of now I have hardcoded direct access to the apple stock.
Link to Apple stock: https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=NgItWIG1GIftsAHCn4zIAg
What i can't understand is that I receive correct output when I trying it in the chrome console with the following command:
$x("//*[#id=\"appbar\"]//div//div//div//span");
This is my current code in Visual studio 2015 with Html Agility Pack installed(I suspect a fault in currDocNodeCompanyName):
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
var histDoc = web.Load("https://www.google.com/finance/historical?q=NASDAQ%3AAAPL&ei=q9IsWNm4KZXjsAG-4I7oCA.html");
var histDocNode = histDoc.DocumentNode.SelectNodes("//*[#id=\"prices\"]//table//tr//td");
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
var currDocNodeCurrency = currDoc.DocumentNode.SelectNodes("//*[#id=\"ref_22144_elt\"]//div//div");
var currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
var histDocText = histDocNode.Select(node => node.InnerText);
var currDocCurrencyText = currDocNodeCurrency.Select(node => node.InnerText);
var currDocCompanyName = currDocNodeCompanyName.Select(node => node.InnerText);
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(currDocCompanyName.Take(2).ToString());
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I have been trying the Xpath expression [text] and received an output that i can work with when using the chrome console but not in VS. I have also been experimenting with a foreach-loop, a few suggested it to others.
class StockDataAccess
{
HtmlWeb web= new HtmlWeb();
private List<string> testList;
public void FindStock()
{
///same as before
var currDoc = web.Load("https://www.google.com/finance?q=NASDAQ%3AAAPL&ei=CdcsWMjNCIe0swGd3oaYBA.html");
HtmlNodeCollection currDocNodeCompanyName = currDoc.DocumentNode.SelectNodes("//*[#id=\"appbar\"]//div//div//div//span");
///Same as before
List <string> blaList = new List<string>();
foreach (HtmlNode x in currDocNodeCompanyName)
{
blaList.Add(x.InnerText);
}
List<String> result = new List<string>(histDocText.Take(6));
result.Add(currDocCurrencyText.First());
result.Add(blaList[1]);
result.Add(blaList[2]);
testList = result;
}
public List<String> ReturnStock()
{
return testList;
}
}
I would really appreciate if anyone could point me in the right direction.
If you check the contents of currDoc.DocumentNode.InnerHtml you will notice that there is no element with the id "appbar", therefore the result is correct, since the xpath doesn't return anything.
I suspect that the html element you're trying to find is generated by a script (js for example), and that explains why you can see it on the browser and not on the HtmlDocument object, since HtmlAgilityPack does not render scripts, it only download and parse the raw source code.

How To Write To A OneNote 2013 Page Using C# and The OneNote Interop

I have seen many articles about this but all of them are either incomplete or do not answer my question. Using C# and the OneNote Interop, I would like to simply write text to an existing OneNote 2013 Page. Currently I have a OneNote Notebook, with a Section titled "Sample_Section" and a Page called "MyPage".
I need to be able to use C# code to write text to this Page, but I cannot figure out how or find any resources to do so. I have looked at all of the code examples on the web and none answer this simple question or are able to do this. Also many of the code examples are outdated and break when attempting to run them.
I used the Microsoft code sample that shows how to change the name of a Section but I cannot find any code to write text to a Page. There is no simple way to do this that I can see. I have taken a lot of time to research this and view the different examples online but none are able to help.
I have already viewed the MSDN articles on the OneNote Interop as well. I vaguely understand how the OneNote Interop works through XML but any extra help understanding that would also be appreciated. Most importantly I would really appreciate a code example that demonstrates how to write text to a OneNote 2013 Notebook Page.
I have tried using this Stack Overflow answer:
Creating new One Note 2010 page from C#
However, there are 2 things about this solution that do not answer my question:
1) The marked solution shows how to create a new page, not how to write text to it or how to populate the page with any information.
2) When I try to run the code that is marked as the solution, I get an error at the following line:
var node = doc.Descendants(ns + nodeName).Where(n => n.Attribute("name").Value == objectName).FirstOrDefault();
return node.Attribute("ID").Value;
The reason being that the value of "node" is null, any help would be greatly appreciated.
I asked the same question on MSDN forums and was given this great answer. Below is a nice, clean example of how to write to OneNote using C# and the OneNote interop. I hope that this can help people in the future.
static Application onenoteApp = new Application();
static XNamespace ns = null;
static void Main(string[] args)
{
GetNamespace();
string notebookId = GetObjectId(null, OneNote.HierarchyScope.hsNotebooks, "MyNotebook");
string sectionId = GetObjectId(notebookId, OneNote.HierarchyScope.hsSections, "Sample_Section");
string firstPageId = GetObjectId(sectionId, OneNote.HierarchyScope.hsPages, "MyPage");
GetPageContent(firstPageId);
Console.Read();
}
static void GetNamespace()
{
string xml;
onenoteApp.GetHierarchy(null, OneNote.HierarchyScope.hsNotebooks, out xml);
var doc = XDocument.Parse(xml);
ns = doc.Root.Name.Namespace;
}
static string GetObjectId(string parentId, OneNote.HierarchyScope scope, string objectName)
{
string xml;
onenoteApp.GetHierarchy(parentId, scope, out xml);
var doc = XDocument.Parse(xml);
var nodeName = "";
switch (scope)
{
case (OneNote.HierarchyScope.hsNotebooks): nodeName = "Notebook"; break;
case (OneNote.HierarchyScope.hsPages): nodeName = "Page"; break;
case (OneNote.HierarchyScope.hsSections): nodeName = "Section"; break;
default:
return null;
}
var node = doc.Descendants(ns + nodeName).Where(n => n.Attribute("name").Value == objectName).FirstOrDefault();
return node.Attribute("ID").Value;
}
static string GetPageContent(string pageId)
{
string xml;
onenoteApp.GetPageContent(pageId, out xml, OneNote.PageInfo.piAll);
var doc = XDocument.Parse(xml);
var outLine = doc.Descendants(ns + "Outline").First();
var content = outLine.Descendants(ns + "T").First();
string contentVal = content.Value;
content.Value = "modified";
onenoteApp.UpdatePageContent(doc.ToString());
return null;
}
This is just what I've gleaned from reading examples on the web (of course, you've already read all of those) and peeking into the way OneNote stores its data in XML using ONOMspy (http://blogs.msdn.com/b/johnguin/archive/2011/07/28/onenote-spy-omspy-for-onenote-2010.aspx).
If you want to work with OneNote content, you'll need a basic understanding of XML. Writing text to a OneNote page involves creating an outline element, whose content will be contained in OEChildren elements. Within an OEChildren element, you can have many other child elements representing outline content. These can be of type OE or HTMLBlock, if I'm reading the schema correctly. Personally, I've only ever used OE, and in this case, you'll have an OE element containing a T (text) element. The following code will create an outline XElement and add text to it:
// Get info from OneNote
string xml;
onApp.GetHierarchy(null, OneNote.HierarchyScope.hsSections, out xml);
XDocument doc = XDocument.Parse(xml);
XNamespace ns = doc.Root.Name.Namespace;
// Assuming you have a notebook called "Test"
XElement notebook = doc.Root.Elements(ns + "Notebook").Where(x => x.Attribute("name").Value == "Test").FirstOrDefault();
if (notebook == null)
{
Console.WriteLine("Did not find notebook titled 'Test'. Aborting.");
return;
}
// If there is a section, just use the first one we encounter
XElement section;
if (notebook.Elements(ns + "Section").Any())
{
section = notebook.Elements(ns + "Section").FirstOrDefault();
}
else
{
Console.WriteLine("No sections found. Aborting");
return;
}
// Create a page
string newPageID;
onApp.CreateNewPage(section.Attribute("ID").Value, out newPageID);
// Create the page element using the ID of the new page OneNote just created
XElement newPage = new XElement(ns + "Page");
newPage.SetAttributeValue("ID", newPageID);
// Add a title just for grins
newPage.Add(new XElement(ns + "Title",
new XElement(ns + "OE",
new XElement(ns + "T",
new XCData("Test Page")))));
// Add an outline and text content
newPage.Add(new XElement(ns + "Outline",
new XElement(ns + "OEChildren",
new XElement(ns + "OE",
new XElement(ns + "T",
new XCData("Here is some new sample text."))))));
// Now update the page content
onApp.UpdatePageContent(newPage.ToString());
Here's what the actual XML you're sending to OneNote looks like:
<Page ID="{20A13151-AD1C-4944-A3D3-772025BB8084}{1}{A1954187212743991351891701718491104445838501}" xmlns="http://schemas.microsoft.com/office/onenote/2013/onenote">
<Title>
<OE>
<T><![CDATA[Test Page]]></T>
</OE>
</Title>
<Outline>
<OEChildren>
<OE>
<T><![CDATA[Here is some new sample text.]]></T>
</OE>
</OEChildren>
</Outline>
</Page>
Hope that helps get you started!
If you're using C#, Check out the newer OneNote REST API at http://dev.onenote.com. It already supports creating a new page and has a beta API to patch and add content to an existing page.

Select specific text from website page

I get a webpage content using this code:
static void Main(string[] args)
{
using (var client = new WebClient())
{
var pageContent = client.DownloadString("http://www.modern-railways.com");
Console.WriteLine(pageContent);
Console.ReadLine();
}
}
This is what I get:
…….News: <span class='articleTitle'>Victoria Metrolink improvement begins</span></a></h1><p><a href='/view_article.asp?ID=7541&pubID=37&t=0&s=0&sO=both&p=1&i=10' class='summaryText' data-ajax='false'>Published 13 February 2014, 11:28</a></p><div class='articleContent ui-widget ui-widget-content ui-helper-clearfix ui-corner-all '….
I need to capture all the "articleTitle" and the published date in the pageContent in which there are several of them. How can I do that? I need some direction.
You can use regular expressions to accomplish your challenge:
var regex = new Regex(#"<span class='articleTitle'>(.+?)</span>");
var match = regex.Match(pageContent);
var result = match.Groups[1].Value;
The above code will work assuming that the tag is built in the exactly same way every time.
foreach (Match itemMatch in regex.Matches(pageContent))
{
var articleTitle= itemMatch.Groups[1].Value;
//TODO do what you need with the articleTitle (e.g. add to a list)
}

Replacing InnerText of a TextBox

Using the MS OpenXml Sdk I have been able to copy a templatestream to a resultstream and append dynamic text(w.p>>w.r>>w.t) at the end of the body using the following code:
var templateStream = File.OpenRead(templatePath);
templateStream.CopyTo(resultStream);
using (var resultPackage = WordprocessingDocument.Open(resultStream, true))
{
var document = resultPackage.MainDocumentPart.Document;
var body = document.Body;
// Add new text.
var para = body.AppendChild(new Paragraph());
var run = para.AppendChild(new Run());
run.AppendChild(new Text(firstName));
document.Save();
}
My next logical step was to then replace the innertext of a textbox in the resultStream with the firstName as in the code below.
// replacing code in using statement from above
var document = resultPackage.MainDocumentPart.Document;
var textbox = document.Descendants<TextBox>().First();
const string firstNametag = "<<IH.FirstName>>";
if (textbox.InnerText.Contains(firstNametag))
{
var textboxContent = textbox.Elements<TextBoxContent>().First();
textboxContent.RemoveAllChildren();
var paragraph = textboxContent.AppendChild(new Paragraph());
var run = paragraph.AppendChild(new Run());
run.AppendChild(new Text(firstName));
}
document.Save();
In the first example and with some additional code the result stream is appropriately serialized to a docx and the firstName is appended to the end of the body when viewed in Word. In the second example though the textbox and its contents remain the same even though further examination in the debugger showed the textboxContent's children reflecting the changes made above.
I am new to OpenXML development so if there is anything obvious please point it out.
Oh wow, I don't even want to think I understand the whole picture around this but here's a quick stab at it. Someone with more openXml experience please chime in...
Turns out when you create a textbox in word on a docx the document.xml file get's the following markup:
<w.r>
<mc.AlertnateContent>
<mc.Choice Requires="wps">
<wps:txbx>
<w:txbxContent>
<w:r>
<w.t>
Text Goes Here
</w.t>
</w.r>
</w:txbxContent>
</wps:txbx>
</mc.Choice>
<mc.Fallback>
<v.textbox>
<w:txbxContent>
<w:r>
<w.t>
Text Goes Here
</w.t>
</w.r>
</w:txbxContent>
</v.textbox>
</mc.Fallback>
</mc.AlertnateContent>
</w.r>
Notice the mc.AlternateContent, mc.Choice, and mc.Fallback tags. What the heck are these??? Someone put it this way on a blog article I came across -
"As I understand it - but don't take my word as gospel - AlternateContent can
appear anywhere and provides a mechanism for including enhanced
functionality if the consuming application can handle it, along with a
fallback if it can't."
-Tony Jollans - http://social.msdn.microsoft.com/Forums/en-US/worddev/thread/f8a5c277-7049-48c2-a295-199d2914f4ba/
In my case I was only modifying the fallback textbox(v.txbx not wps.txbx) because of my mishap in assuming Resharper was right in asking me to import the DocumentFormat.OpenXml.Vml namespace for my dependency on the TextBox object. Not sure why there isn't a Textbox definition in one of my already included namespaces, DocumentFormat.OpenXml.Packaging or DocumentFormat.OpenXml.Wordprocessing but that's beyond the scope of this question. Needless to say, upon realizing this and updating my code to look for the common w.txbxContent for the two I achieved what I wanted to do.
Here's the updated code with some refactoring, call the ReplaceTag method in the using statement from the original question and supply a model object instead of a string. Also, use the tagToValueSelector dictionary for convenience.
private void ReplaceTags(Document document, SomeModel model)
{
var textBoxContents = document.Descendants<TextBoxContent>().ToList();
foreach (var textBoxContent in textBoxContents)
{
ReplaceTag(textBoxContent, model);
}
}
private void ReplaceTag(TextBoxContent textBoxContent, SomeModel model)
{
var tag = textBoxContent.InnerText.Trim();
if (!tagsTomValues.ContainsKey(tag)) return;
var valueSelector = tagsTomValues[tag];
textBoxContent.RemoveAllChildren();
var paragraph = textBoxContent.AppendChild(new Paragraph());
var run = paragraph.AppendChild(new Run());
run.AppendChild(new Text(valueSelector(model)));
}
// called in the ctor
private static void IntializeTags(IDictionary<string, Func<SomeModel, string>> dictionary)
{
dictionary.Add("<<IH.Name>>", m => string.Format("{0} {1}", m.FirstName, m.LastName));
}
Happy openXmling :)

Categories

Resources