Parse Html Document Get All input fields with ID and Value - c#

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance

An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values

To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack

Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'

You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation

Related

OpenXML SDK: How to get "Tag" (p:tag) val of a PowerPoint shape?

I need to check all tags on all shapes on all slides. I can select each shape, however I can't see how to get the shape's tags.
For the given DocumentFormat.OpenXml.Presentation.Shape, how can I get the "val" of the tag with name="MOUNTAIN"
In my shape, the tag rId is in this structure: p:sp > p:nvSpPr > p:cNvPr > p:nvPr > p:custDataList > p:tags
I'm guessing my code needs to do these steps:
• Get the rId of the p:custDataLst p:tags
• Look up the "Target" file name in the slideX.xml.rels file, based on the rId
• Look in the root/tags folder for the "Target" file
• Get the p:tagLst p:tags and look for the p:tag with name="MOUNTAIN"
<p:tagLst
<p:tag name="MOUNTAIN" val="Denali"/>
</p:tagLst>
Here is how my code iterates through shapes on each slide:
for (int x = 0; x < doc.PresentationPart.SlideParts.Count(); x++)
{
SlidePart slide = doc.PresentationPart.SlideParts.ElementAt(x);
ShapeTree tree = slide.Slide.CommonSlideData.ShapeTree;
IEnumerable<DocumentFormat.OpenXml.Presentation.Shape> slShapes = slide.Slide.Descendants<DocumentFormat.OpenXml.Presentation.Shape>();
foreach (DocumentFormat.OpenXml.Presentation.Shape shape in slShapes)
{
//get the specified tag, if it exists
}
}
I see an example of how to add tags: How to add custom tags to powerpoint slides using OpenXml in c#
But I can't figure out how to read the existing tags.
So, how do I get the shape's tags with c#?
I was hoping to do something like this:
IEnumerable<UserDefinedTagsPart> userDefinedTagsParts = shape.NonVisualShapeProperties.ApplicationNonVisualDrawingProperties.CustomerDataList.CustomerDataTags<UserDefinedTagsPart>();
foreach (UserDefinedTagsPart userDefinedTagsPart in userDefinedTagsParts)
{}
but Visual Studio says "ApplicationNonVisualDrawingProperties does not contain a definition for CustomerDataList".
From the OpenXML Productivity Tool, here is the element tree:
You and I seem to be working on similar problems. I'm struggling with learning the file format. The following code is working for me, I'm sure it can be optimized.
public void ReadTags(Shape shape, SlidePart slidePart)
{
NonVisualShapeProperties nvsp = shape.NonVisualShapeProperties;
ApplicationNonVisualDrawingProperties nvdp = nvsp.ApplicationNonVisualDrawingProperties;
IEnumerable<CustomerDataTags> data_tags = nvdp.Descendants<CustomerDataTags>();
foreach (var data_tag in data_tags)
{
UserDefinedTagsPart shape_tags = slidePart.GetPartById(data_tag.Id) as UserDefinedTagsPart;
if (shape_tags != null)
{
foreach (Tag tag in shape_tags.TagList)
{
Debug.Print($"\t{nvsp.NonVisualDrawingProperties.Name} tag {tag.Name} = '{tag.Val}");
}
}
}
}
I've spent a lot of time with OpenXML .docx and .xlsx files ... but not so much with .pptx.
Nevertheless, here are a couple of suggestions that might help:
If you haven't already done so, please downoad the OpenXML SDK Productivity Tool to analyze your file's contents. It's currently available on GitHub:
https://github.com/dotnet/Open-XML-SDK/releases/tag/v2.5
You might simply be able to "grep" for items you're looking for.
EXAMPLE (Word, not PowerPoint... but the same principle should apply):
using (doc = WordprocessingDocument.Open(stream, true))
{
// Init OpenXML members
mainPart = doc.MainDocumentPart;
body = mainPart.Document.Body;
...
foreach (var text in body.Descendants<Text>())
{
if (text.Text.Contains(target))
...

Replace text in docx file with content of another docx file

I'm trying to use OpenXml to replace a text "Veteran" in file A.docx with content in B.docx . If B.docx contains text or paragraph , it works fine and I get modified A.docx file.
However, if B.docx contains a table, then the code doesn't work.
static void Main(string[] args)
{
SearchAndReplace(#"C:\A.docx", #"C:\B.docx");
}
public static void SearchAndReplace(string docTo, string docFrom)
{
List<WordprocessingDocument> docList = new List<WordprocessingDocument>();
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(docTo, true))
using (WordprocessingDocument wordDoc1 = WordprocessingDocument.Open(docFrom, true))
{
var parts = wordDoc1.MainDocumentPart.Document.Descendants().FirstOrDefault();
docList.Add(wordDoc);
docList.Add(wordDoc1);
if (parts != null)
{
foreach (var node in parts.ChildElements)
{
if (node is Table)
{
ParseTable(docList, (Table)node, textBuilder);
}
}
}
}
}
public static void ParseText(List<WordprocessingDocument> wpd, Paragraph node, StringBuilder textBuilder)
{
Body body = wpd[0].MainDocumentPart.Document.Body;
Body body1 = wpd[1].MainDocumentPart.Document.Body;
string content = body1.InnerXml;
var paras = body.Elements<Paragraph>();
foreach (var para in paras)
{
foreach (var run in para.Elements<Run>())
{
foreach (var text in run.Elements<Text>())
{
if (text.Text.Contains("Veteran"))
{
run.InnerXml.Replace(run.InnerXml, content);
break;
}
}
}
}
}
public static void ParseTable(List<WordprocessingDocument> wpd, Table node, StringBuilder textBuilder)
{
foreach (var row in node.Descendants<TableRow>())
{
textBuilder.Append("| ");
foreach (var cell in row.Descendants<TableCell>())
{
foreach (var para in cell.Descendants<Paragraph>())
{
ParseText(wpd, para, textBuilder);
}
textBuilder.Append(" | ");
}
textBuilder.AppendLine("");
}
}
}
}
How to make this work ? Is there a better way to replace content with another docx file?
Not having enough detail for a specific answer, here's how you solve such problems in general:
Ensure you understand the Open XML specification and valid Open XML markup on an appropriate level of detail.
If you don't understand what w:document, w:body, w:p, w:r, w:t, w:tbl, etc. are and how they relate to each other, you have no chance.
You must look at actual Open XML markup, e.g., using the Open XML Productivity Tool or the Open XML Package Editor for Modern Visual Studios to get to an appropriate level of understanding and develop Open XML-based solutions.
Understand that most Open XML-related code transforms some source markup into some target markup. Therefore, you must:
understand the source and target markup first and then
define the transformation required to create the target from the source.
Depending on what you need to do, the Open XML Productivity Tool can help create the transforming code. If you have a source and target document, you can use the Productivity Tool to compare those documents. This shows the difference in the markup, so you see what markup is created, deleted, or changed. It even shows you the Open XML SDK-based code required to effect the change.
In my own use cases, I typically prefer to write recursive, pure functional transformations. While you need to wrap your head around the concept, this is an extremely powerful approach.
In your case, you should:
take a few representative, manually-created samples of source (A.docx with "Vetaran" still to be replaced) and target (A.docx with "Veteran" replaced as desired) documents;
look at the Open XML markup of the source and target documents; and
write code that creates the target markup.
Once you have created code that at least tries to create valid target Open XML markup, you could come back with further questions in case you identify further issues.

How to use Html Agility Pack to pick out the specific text?

Below is an example portion of a block of the Html that I am trying to extract information from:
<a href="https://secure.tibia.com/community/?subtopic=characters&name=Alemao+Golpista" >Alemao Golpista</a></td><td style="width:10%;" >51</td><td style="width:20%;" >Knight</td></tr><tr class="Even" style="text-align:right;" ><td style="width:70%;text-align:left;" >
I am basically grabbing the entire Html which is a list of players online and trying to append them to a list with the: Name (Alemao Golpista), Level (51), and Vocation (Knight).
Using regex for it is a pain in the ass and pretty slow how would I go about it using the Agility Pack?
Don't ever use regex to parse html files. As has already been stated you should use whatever HtmlagilityPack examples you can find even though they are scarce on their site. And the documentation isn't easy to find.
To get you started here is how you can load an HtmlDocument and get the anchor tags' href attributes.
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url.Url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
HtmlNodeCollection c = htmlDoc.DocumentNode.SelectNodes("//a");
List<string> urls = new List<string>();
foreach(HtmlNode n in c){
urls.Add(n.GetAttributeValue("href", ""));
}
The above code gets you all the links of a webpage in a string array.
You should look into xpath. And you should also get the documentation of HAP and read it. I couldn't find the documentation anywhere, so I uploaded the one I already had on my computer.

make script to download all Mp3 files from a page

I have a page that contains some links to .mp3/.wav files in that format
File Name
what I need to make a script that will download all these files instead of downloading them my self
I know that I can use regular expression to do some thing like that but i don't know how ? and what is the best choose to do that (Java , C# , JavaScript) ?
Any help will be appreciated
Thanks in Advance
You could use SgmlReader to parse the DOM and extract all the anchor links and then download the corresponding resources:
class Program
{
static void Main()
{
using (var reader = new SgmlReader())
{
reader.DocType = "HTML";
reader.Href = "http://www.example.com";
var doc = new XmlDocument();
doc.Load(reader);
var anchors = doc.SelectNodes("//a/#href[contains(., 'mp3') or contains(., 'wav')]");
foreach (XmlAttribute href in anchors)
{
using (var client = new WebClient())
{
var data = client.DownloadData(href.Value);
// TODO: do something with the downloaded data
}
}
}
}
}
Well, if you want to go hard-core, I think parsing the page with DOMDocument ( http://php.net/manual/en/class.domdocument.php ) and retrieving the files with cURL would do it if you're ok with PHP.
How many files are we talking about here?
Python's Beautiful Soup library is well-suited to this task:
http://www.crummy.com/software/BeautifulSoup/
Could be used in this way:
import urllib2, re
from BeautifulSoup import BeautifulSoup
#open the URL
page = urllib2.urlopen("http://www.foo.com")
#parse the page
soup = BeautifulSoup(page)
#get all anchor elements
anchors = soup.findAll("a")
#filter anchors based on their href attribute
filteredAnchors = filter(lambda a : re.search("\.wav",a["href"]) or re.search("\.mp3",a["href"]), anchors)
urlsToDownload = map(lambda a : a["href"],filteredAnchors)
#download each anchor url...
See here for instructions on downloading the mp3's from their URLs: How do I download a file over HTTP using Python?

Encoding error when using HTML Agility Pack

I'm trying to parse a html doc
using some code I found from this actual site
but I keep getting a parsing error
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load(#"C:\Documents and Settings\Mine\My Documents\Random.html");
// Use: htmlDoc.LoadXML(xmlString); to load from a string
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
MessageBox.Show("Oh no");
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//head");
if (bodyNode != null)
{
MessageBox.Show("Hello");
}
}
}
Any help would be appreciated :)
In the wild, HTML is likely to be non-conformant, non-compliant, and non-validating. Only XHTML or very simple HTML will go without populating ParseErrors. I've noticed that the HTML Agility Pack is fairly robust and will still build a decent DOM tree from most HTML sources, even if ParseErrors are generated. Drop the else, and let that else block execute normally.
If it did not build the DOM tree, then you should investigate the ParseError(s) that were generated. If it only built a partial tree, try recursing over the nodes, printing or messagebox'ing to see which parts of the DOM tree got built or not. You might not need the whole tree.

Categories

Resources