Encoding error when using HTML Agility Pack - c#

I'm trying to parse a html doc
using some code I found from this actual site
but I keep getting a parsing error
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.Load(#"C:\Documents and Settings\Mine\My Documents\Random.html");
// Use: htmlDoc.LoadXML(xmlString); to load from a string
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
MessageBox.Show("Oh no");
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//head");
if (bodyNode != null)
{
MessageBox.Show("Hello");
}
}
}
Any help would be appreciated :)

In the wild, HTML is likely to be non-conformant, non-compliant, and non-validating. Only XHTML or very simple HTML will go without populating ParseErrors. I've noticed that the HTML Agility Pack is fairly robust and will still build a decent DOM tree from most HTML sources, even if ParseErrors are generated. Drop the else, and let that else block execute normally.
If it did not build the DOM tree, then you should investigate the ParseError(s) that were generated. If it only built a partial tree, try recursing over the nodes, printing or messagebox'ing to see which parts of the DOM tree got built or not. You might not need the whole tree.

Related

How to access OpenXML content by page number?

Using OpenXML, can I read the document content by page number?
wordDocument.MainDocumentPart.Document.Body gives content of full document.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
int pageCount = 0;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
for (int i = 1; i <= pageCount; i++)
{
//Read the content by page number
}
}
}
MSDN Reference
Update 1:
it looks like page breaks are set as below
<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
<w:r>
<w:br w:type="page" />
</w:r>
</w:p>
So now I need to split the XML with above check and take InnerTex for each, that will give me page vise text.
Now question becomes how can I split the XML with above check?
Update 2:
Page breaks are set only when you have page breaks, but if text is floating from one page to other pages, then there is no page break XML element is set, so it revert back to same challenge how o identify the page separations.
You cannot reference OOXML content via page numbering at the OOXML data level alone.
Hard page breaks are not the problem; hard page breaks can be counted.
Soft page breaks are the problem. These are calculated according to
line break and pagination algorithms which are implementation
dependent; it is not intrinsic to the OOXML data. There is nothing
to count.
What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:
By definition, w:lastRenderedPageBreak position is stale when content has
been changed since last opened by a program that paginates its
content.
In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
when table spans two pages
when next page starts with an empty paragraph
for
multi-column layouts with text boxes starting a new column
for
large images or long sequences of blank lines
If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.
Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.
This is how I ended up doing it.
public void OpenWordprocessingDocumentReadonly()
{
string filepath = #"C:\...\test.docx";
// Open a WordprocessingDocument based on a filepath.
Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
int pageCount = 0;
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
{
pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
}
int i = 1;
StringBuilder pageContentBuilder = new StringBuilder();
foreach (var element in body.ChildElements)
{
if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
{
pageContentBuilder.Append(element.InnerText);
}
else
{
pageviseContent.Add(i, pageContentBuilder.ToString());
i++;
pageContentBuilder = new StringBuilder();
}
if (body.LastChild == element && pageContentBuilder.Length > 0)
{
pageviseContent.Add(i, pageContentBuilder.ToString());
}
}
}
}
Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();
List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();
Rename docx to zip.
Open docProps\app.xml file. :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<Template>Normal</Template>
<TotalTime>0</TotalTime>
<Pages>1</Pages>
<Words>141</Words>
<Characters>809</Characters>
<Application>Microsoft Office Word</Application>
<DocSecurity>0</DocSecurity>
<Lines>6</Lines>
<Paragraphs>1</Paragraphs>
<ScaleCrop>false</ScaleCrop>
<HeadingPairs>
<vt:vector size="2" baseType="variant">
<vt:variant>
<vt:lpstr>Название</vt:lpstr>
</vt:variant>
<vt:variant>
<vt:i4>1</vt:i4>
</vt:variant>
</vt:vector>
</HeadingPairs>
<TitlesOfParts>
<vt:vector size="1" baseType="lpstr">
<vt:lpstr/>
</vt:vector>
</TitlesOfParts>
<Company/>
<LinksUpToDate>false</LinksUpToDate>
<CharactersWithSpaces>949</CharactersWithSpaces>
<SharedDoc>false</SharedDoc>
<HyperlinksChanged>false</HyperlinksChanged>
<AppVersion>14.0000</AppVersion>
</Properties>
OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

How to use Html Agility Pack to pick out the specific text?

Below is an example portion of a block of the Html that I am trying to extract information from:
<a href="https://secure.tibia.com/community/?subtopic=characters&name=Alemao+Golpista" >Alemao Golpista</a></td><td style="width:10%;" >51</td><td style="width:20%;" >Knight</td></tr><tr class="Even" style="text-align:right;" ><td style="width:70%;text-align:left;" >
I am basically grabbing the entire Html which is a list of players online and trying to append them to a list with the: Name (Alemao Golpista), Level (51), and Vocation (Knight).
Using regex for it is a pain in the ass and pretty slow how would I go about it using the Agility Pack?
Don't ever use regex to parse html files. As has already been stated you should use whatever HtmlagilityPack examples you can find even though they are scarce on their site. And the documentation isn't easy to find.
To get you started here is how you can load an HtmlDocument and get the anchor tags' href attributes.
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
try{
var temp = new Uri(url.Url);
var request = (HttpWebRequest)WebRequest.Create(temp);
request.Method = "GET";
using (var response = (HttpWebResponse)request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
htmlDoc.Load(stream, Encoding.GetEncoding("iso-8859-9"));
}
}
}catch(WebException ex){
Console.WriteLine(ex.Message);
}
HtmlNodeCollection c = htmlDoc.DocumentNode.SelectNodes("//a");
List<string> urls = new List<string>();
foreach(HtmlNode n in c){
urls.Add(n.GetAttributeValue("href", ""));
}
The above code gets you all the links of a webpage in a string array.
You should look into xpath. And you should also get the documentation of HAP and read it. I couldn't find the documentation anywhere, so I uploaded the one I already had on my computer.

Parse Html Document Get All input fields with ID and Value

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.
Basically like:
foreach(var htmlDoc in HtmlFolder)
{
foreach(var inputBox in htmlDoc)
{
//Make Collection of ID and Values Insert to DB
}
}
From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?
Thanks in advance
An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:
var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
//get values
To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.
The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.
Basics you'll need to know:
//import the HtmlAgilityPack
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);
// OR
// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------
// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
.Where(node => !string.IsNullOrWhitespace(node.Id)
&& node.Attributes["value"] != null
&& !string.IsNullOrWhitespace(node.Attributes["value"].Value));
// OR
// Finding things using XPath
var nodes = doc.DocumentNode
.SelectNodes("//input[not(#id='') and not(#value='')]");
// -----------------------------
// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null)
{
foreach (var node in nodes)
{
var id = node.Id;
var value = node.Attributes["value"].Value;
}
}
The easiest way to add the HtmlAgility Pack is using NuGet:
PM> Install-Package HtmlAgilityPack
Hah, looks like the ideal time to make a shameless plug of a library I wrote!
This should be rather easy to accomplish with this library (that's built on top of HtmlAgility pack by the way!) : https://github.com/amoerie/htmlbuilders
(You can find the Nuget package here: https://www.nuget.org/packages/HtmlBuilders/ )
Code samples:
const string html = "<div class='invoice'><input type='text' name='abc' value='123'/><input id='ohgood' type='text' name='def' value='456'/></div>";
var htmlDocument = new HtmlDocument {OptionCheckSyntax = false}; // avoid exceptions when html is invalid
htmlDocument.Load(new StringReader(html));
var tag = HtmlTag.Parse(htmlDocument); // if there is a root tag
var tags = HtmlTag.ParseAll(htmlDocument); // if there is no root tag
// find looks recursively through the entire DOM tree
var inputFields = tag.Find(t => string.Equals(t.TagName, "input"));
foreach (var inputField in inputFields)
{
Console.WriteLine(inputField["type"]);
Console.WriteLine(inputField["value"]);
if(inputField.HasAttribute("id"))
Console.WriteLine(inputField["id"]);
}
Note that inputField[attribute] will throw a 'KeyNotFoundException' if that field does not have the specified attribute name. That's because HtmlTag implements and reuses IDictionary logic for its attributes.
Edit: If you're not running this code in a web environment, you'll need to add a reference to System.Web. That's because this library makes use of the HtmlString class which can be found in System.Web. Just choose 'Add reference' and then you can find it under 'Assemblies > Framework'
You can download HtmlAgilityPack Documents CHM file from here.
If chm file contents are not visible then un-check Always ask before opening this file check-box as shown in screen shot
Note: The above dialog appears for unsigned files
Source: HtmlAgilityPack Documentation

C# Strip HTML Markup in XML

i really hope someone can help me with this issue. The solution should be on C#.
I have a xml file with the size of 36 MB and with 900k lines. On some nodes it has a lot of html markup and some invalid markup like
<Obs><p>
<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p>
I've tried different ways to clean this file but only one way is able to perform the task, however, as this is being executed on a web application it's blocking the application and taking around 6 minutes to finish the task and consuming around 450MB in memory.
As this file is an invalid xml i cannot use XmlTextReader.
Using XLST, based on Strip HTML-like characters (not markup) from XML with XSLT? ,strangely i'm also with problems with HTML Entities.
The process that worked (with some tweaks) is the following on http://www.codeproject.com/Articles/19652/HTML-Tag-Stripper
Thanks
Edit:
Following Kevin's suggestions. I'm trying to build a solution using HTML Agility Pack.
At least to do some benchmarks.
I'm stuck however. Imagine the following xml node:
<Obs><p> I WANT THIS TEXT<jantes -="" .="" 22.000="" apenas="" exclusive="" kms.="" leve="" liga="" o=""> </jantes></p></Obs>
How can i strip the tags inside "obs" tag, keep the tag "obs" and also keep the text "I WANT THIS TEXT" ? Basically this:
<Obs>I WANT THIS TEXT</Obs>
For now this is the code i have:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
foreach (HtmlNode nodeToStrip in childNodes)
nodeToStrip.ParentNode.RemoveChild(nodeToStrip);
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Thanks :)
EDIT 2
Ok, i was able to complete the task. However this is taking too much time. About 3 hours and consuming 800MB in memory.
Still needing help!
Here is the code, it might help someone.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
Queue<HtmlNode> nodes = new Queue<HtmlNode>(doc.DocumentNode.SelectNodes("./*|./text()"));
while (nodes.Count > 0)
{
HtmlNode node = nodes.Dequeue();
HtmlNode parentNode = node.ParentNode;
HtmlNodeCollection childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (HtmlNode child in childNodes)
{
if (child.Name != "obs")
{
nodes.Enqueue(child);
}
else
{
childNodes = child.SelectNodes("//p|//jantes");
if (childNodes != null)
{
foreach (HtmlNode nodeToStrip in childNodes)
{
var replacement = doc.CreateTextNode(nodeToStrip.InnerText);
nodeToStrip.ParentNode.ReplaceChild(replacement, nodeToStrip);
}
}
}
}
}
}
string s = doc.DocumentNode.InnerHtml;
Have you tried Html Agility Pack? Among its claims:
the parser is very tolerant with "real world" malformed HTML
you can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it

determine if xml file contains data - c#

How do i know if my XML file has data besides the name space info:
Some of the files contain this:
<?xml version="1.0" encoding="UTF-8"?>
And if i encounter such a file, i want to place the file in an error directory
You could use the XmlReader to avoid the overhead of XmlDocument. In your case, you will receive an exception because the root element is missing.
string xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
using (StringReader strReader = new StringReader(xml))
{
//You can replace the StringReader object with the path of your xml file.
//In that case, do not forget to remove the "using" lines above.
using (XmlReader reader = XmlReader.Create(strReader))
{
try
{
while (reader.Read())
{
}
}
catch (XmlException ex)
{
//Catch xml exception
//in your case: root element is missing
}
}
}
You can add a condition in the while(reader.Read()) loop after you checked the first nodes to avoid to read the entire xml file since you just want to check if the root element is missing.
I think the only way is to catch an exception when you try and load it, like this:
try
{
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.Load(Server.MapPath("XMLFile.xml"));
}
catch (System.Xml.XmlException xmlEx)
{
if (xmlEx.Message.Contains("Root element is missing"))
{
// Xml file is empty
}
}
Yes, there is some overhead, but you should be performing sanity checks like this anyway. You should never trust input and the only way to reliably verify it is XML is to treat it like XML and see what .NET says about it!
XmlDocument xDoc = new XmlDocument();
if (xDoc.ChildNodes.Count == 0)
{ // xml document is empty }
if (xDoc.ChildNodes.Count == 1)
{ // in xml document is only declaration node. (if you are shure that declaration is allways at the begining }
if (xDoc.ChildNodes.Count > 1)
{ // there is declaration + n nodes (usually this count is 2; declaration + root node) }
Haven't tried this...but should work.
try
{
XmlDocument doc = new XmlDocument();
doc.Load("test.xml");
}
catch (XmlException exc)
{
//invalid file
}
EDIT: Based on feedback comments
For large XML documents see Thomas's answer. This approach can have performance issues.
But, if it is a valid xml and the program wants to process it then this approach seems better.
If you aren't worried about validity, just check to see if there is anything after the first ?>. I'm not entirely sure of the C# syntax (it's been too long since I used it), but read the file, look for the first instance of ?>, and see if there is anything after that index.
However, if you want to use the XML later or you want to process the XML later, you should consider PK's answer and load the XML into an XmlDocument object. But if you have large XML documents that you don't need to process, then a solution more like mine, reading the file as text, might have less overhead.
You could check if the xml document has a node (the root node) and check it that node has inner text or other children.
As long as you aren't concerned with the validity of the XML document, and only want to ensure that it has a tag other than the declaration, you could use simple text processing:
var regEx = new RegEx("<[A-Za-z]");
bool foundTags = false;
string curLine = "";
using (var reader = new StreamReader(fileName)) {
while (!reader.EndOfStream) {
curLine = reader.ReadLine();
if (regEx.Match(curLine)) {
foundTags = true;
break;
}
}
}
if (!foundTags) {
// file is bad, copy.
}
Keep in mind that there's a million other reasons that the file may be invalid, and the code above would validate a file consisting only of "<a". If your intent is to validate that the XML document is capable of being read, you should use the XmlDocument approach.

Categories

Resources