XML data file not opening and working properly - c#

I developed a WPF application using XML as the database file. Yesterday, the program stopped working. After some checking, I saw that there was a problem with Transaction.xml file. I tried opening the same in IE, but got this error
The XML page cannot be displayed
Cannot view XML input using style sheet. Please correct the error and then click the Refresh button, or try again later.
An invalid character was found in text content. Error processing resource 'file:///C:/RegisterMaintenance/Transaction.xml
Then, I tried opening the file in notepad and it showed weird character(screenshot below).
In the end, its displaying the right structure of xml. Please tell me what has gone wrong and why the xml not showing correctly. How can get it to normal state. I am really worried as this is my only data file. Any help or suggestion will be great.
One of the codes that edit this file, there are other similar types of code files that use Transaction.xml
public string Add()
{
XDocument doc1 = XDocument.Load(#"Ledgers.xml");
XElement elem = (from r in doc1.Descendants("Ledger")
where r.Element("Name").Value == this.Buyer
select r).First();
this.TinNo = (string)elem.Element("TinNo");
this.PhoneNo = (string)elem.Element("PhoneNo");
this.CommissionAmount = (this.CommissionRate * this.Amount) / 100;
this.CommissionAmount = Math.Round((decimal)this.CommissionAmount);
this.VatAmount = (this.CommissionAmount + this.Amount) * this.VatRate / 100;
this.VatAmount = Math.Round((decimal)this.VatAmount);
this.InvoiceAmount = this.Amount + this.CommissionAmount + this.VatAmount;
XDocument doc2 = XDocument.Load(#"Transactions.xml");
var record = from r in doc2.Descendants("Transaction")
where (int)r.Element("Serial") == Serial
select r;
foreach (XElement r in record)
{
r.Element("Invoice").Add(new XElement("InvoiceNo", this.InvoiceNo), new XElement("InvoiceDate", this.InvoiceDate),
new XElement("TinNo", this.TinNo), new XElement("PhoneNo", this.PhoneNo), new XElement("TruckNo", this.TruckNo), new XElement("Source", this.Source),
new XElement("Destination", this.Destination), new XElement("InvoiceAmount", this.InvoiceAmount),
new XElement("CommissionRate", this.CommissionRate), new XElement("CommissionAmount", this.CommissionAmount),
new XElement("VatRate", this.VatRate), new XElement("VatAmount", this.VatAmount));
}
doc2.Save(#"Transactions.xml");
return "Invoice Created Successfully";
}

C# is an Object Orient Programming (OOP) language, perhaps you should use some objects! How can you possibly test your code for accuracy?
You should separate out responsibilities, an example:
public class Vat
{
XElement self;
public Vat(XElement parent)
{
self = parent.Element("Vat");
if (null == self)
{
parent.Add(self = new XElement("Vat"));
// Initialize values
Amount = 0;
Rate = 0;
}
}
public XElement Element { get { return self; } }
public decimal Amount
{
get { return (decimal)self.Attribute("Amount"); }
set
{
XAttribute a = self.Attribute("Amount");
if (null == a)
self.Add(new XAttribute("Amount", value));
else
a.Value = value.ToString();
}
}
public decimal Rate
{
get { return (decimal)self.Attribute("Rate"); }
set
{
XAttribute a = self.Attribute("Rate");
if (null == a)
self.Add(new XAttribute("Rate", value));
else
a.Value = value.ToString();
}
}
}
All the Vat data will be in one node, and all the accessing of it will be in one testable class.
Your above foreach would look more like:
foreach(XElement r in record)
{
XElement invoice = r.Add("Invoice");
...
Vat vat = new Vat(invoice);
vat.Amount = this.VatAmount;
vat.Rate = this.VatRate;
}
That is readable! At a glance, from your code, I cannot even tell if invoice is the parent of Vat, but I can now!
Note: This isn't to say your code is at fault, it could be a hard-drive error, as that is what it looks like to me. But if you want people to peruse your code, make it readable and testable! Years from now if you or someone else has to change your code, if it isn't readable, it is useless.
Perhaps from this incident you learned two things
read-ability and test-ability.
Backups! (All my valuable Xml files are in a SVN (TortoiseSVN) so I can compare what has changed, as well as keeping good backups. The SVN is backed-up to online storage.)
An ideal next step is to take the code in the property setters and refactor that out to a static function extension that is both testable and reproducable:
public static class XAttributeExtensions
{
public static XAttribute SetAttribute(this XElement self, string name, object value)
{
// test for correct arguments
if (null == self)
throw new ArgumentNullException("XElement to SetAttribute method cannot be null!");
if (string.IsNullOrEmpty(name))
throw new ArgumentNullException("Attribute name cannot be null or empty to SetAttribute method!");
if (null == value) // how to handle?
value = ""; // or can throw an exception like one of the above.
// Now to the good stuff
XAttribute a = self.Attribute(name);
if (null == a)
self.Add(a = new XAttribute(name, value));
else
a.Value = value.ToString();
return a;
}
}
That is easily testable, very readable and the best is it can be used over and over again getting the same results!
Example, the Amount property can be greatly simplified with:
public decimal Amount
{
get { return (decimal)self.Attribute("Amount"); }
set { self.SetAttribute("Amount", value); }
}
I know this is a lot of boiler-plate code, but I find it readable, extendable and best of all test-able. If I want to add another value to Vat, I can just modify the class and not have to worry about have I added it in the right place. If Vat had children, I'd make another class that Vat had a property for.

The .xml is clearly malformed. No browser or other program that reads xml files will be able to do anything with it. It doesn't matter that the xml starts being correct after some lines.
So the error is most certainly it whatever creates and/or edits your xml file. You should have a look there. Maybe the encoding is wrong. The most used encoding is UTF-8.
Also, as a side note, XML is not really the best format for large databases (too much overhead), so switching to a binary format would be best. Even switching to JSON would bring a benefit.

Related

Do not return error since i expect null

Here i have example of getting inventory from json string.
inventory = JsonUtility.FromJson<InventoryModel>GFile.GetPlayer(FileComponent.PlayerInventory));
Since i am loading that string from file it is possible it is just blank and i want to first check if it is blank and i would do it like this:
if(GFile.GetPlayer(FileComponent.PlayerInventory) != " ")
{
inventory = JsonUtility.FromJson<InventoryModel>(GFile.GetPlayer(FileComponent.PlayerInventory));
}
So my question is if there is any more elegant way of doing this instead of typing if statement like this?
Why not like this? :
var player = GFile.GetPlayer(FileComponent.PlayerInventory);
if(!string.IsNullOrWhiteSpace(player)) {
inventory = JsonUtility.FromJson<InventoryModel>(player);
}
I'd suggest
string data = GFile.GetPlayer(FileComponent.PlayerInventory);
if(!string.IsNullOrWhiteSpace(data))
{
inventory = JsonUtility.FromJson<InventoryModel>(data);
}
This way you only call GetPlayer once, and it doesn't matter if the resulting data is an empty string or is full of whitespace - it still won't enter that block and set inventory.
Edit
For older versions of .Net, this will also work
string data = GFile.GetPlayer(FileComponent.PlayerInventory);
if(data != null && data.Trim().Length == 0)
{
inventory = JsonUtility.FromJson<InventoryModel>(data);
}

How can I improve the speed of loading xml documents

I need to load 2 types of xml documents; one has 50 sub-children and the other has the same 50 and 800 additional ones. Performance is great with the smaller doc and acceptable with the larger doc until the number of children increases. 20k children * 50 sub-children = great performance, 20k children * 850 sub-children = slow performance. How would I skip looking for the extra descendants when they do not exist? My initial attempts lead me to think I need have separate classes, methods, viewmodels, and views for both the small and large docs. Below is a condensed look at my code.
public class MyItem
{
private string layout;
private string column;
private string columnSpan;
private string row;
private string rowSpan;
private string background;
public MyItem(string layout, string column, string columnSpan, string row, string rowSpan, string background)
{
Layout = layout;
Column = column;
ColumnSpan = columnSpan;
Row = row;
RowSpan = rowSpan;
Background = background;
}
public string Layout
{
get { return this.layout; }
set { this.layout = value; }
}
(Not Shown - Column, ColumnSpan, Row, RowSpan, and Background which are handled the same way as Layout)
Just for this example, below shows only 6 sub-children, I am looking for a way to load xml docs with only the first 2 sub-children. This way I can use whatever load method is required by both small or large xml docs.
internal class myDataSource
{
//Loads (MyList) xml file
public static List<MyItem> Load(string MyListFilename)
{
var myfiles = XDocument.Load(MylistFilename).Descendants("item").Select(
x => new MyItem(
(string)x.Element("layout"),
(string)x.Element("column"),
(string)x.Element("columnSpan"),
(string)x.Element("row"),
(string)x.Element("rowSpan"),
(string)x.Element("background")));
return myfiles.ToList();
}
public class MainViewModel : ViewModelBase
{
public void LoadMyList()
{
this.myfiles = new ObservableCollection<MyItemViewModel>();
List<MyItem> mybaseList = myDataSource.Load(MyListFilename);
foreach (MyItem myitem in mybaseList)
{
this.myfiles.Add(new MyItemViewModel(myitem));
}
this.mycollectionView = (ICollectionView)CollectionViewSource.GetDefaultView(myfiles);
if (this.mycollectionView == null)
throw new NullReferenceException("mycollectionView");
}
}
public class MyItemViewModel: ViewModelBase
{
private Models.MyItem myitem;
public MyItemViewModel(MyItem myitem)
{
if (myitem == null)
throw new NullReferenceException("myitem");
this.myitem = myitem;
}
public string Layout
{
get
{
return this.myitem.Layout;
}
set
{
this.myitem.Layout = value;
OnPropertyChanged("Layout");
}
}
(Not Shown - Column, ColumnSpan, Row, RowSpan, and Background which are handled the same way as Layout)
Instead of using Descendants, can you follow the direct path (i.e. use Elements)? That's the only way you'll keep from scanning nodes you know don't have items.
i think one thing you can do is not do a toList on the Select and keep it as lazy, and return an Iterable instead, or whatever Select returns.(sorry, i don't have a windows box right now to test this on). when you do the foreach, you will iterate over it only once (instead of twice right now)
XDocument is handy, but if the problem is simply that the files are large and you only have to scan once through, XmlReader might be the better choice. It doesn't read the entire file, it reads one node at a time. You can manually skip through parts you are not interested in.

selection using LINQ

I have a sample xml file that looks like this:
<Books>
<Category Genre="Fiction" BookName="book_name" BookPrice="book_price_in_$" />
<Category Genre="Fiction" BookName="book_name" BookPrice="book_price_in_$" />
<Category Genre="NonFiction" BookName="book_name" BookPrice="book_price_in_$" />
<Category Genre="Children" BookName="book_name" BookPrice="book_price_in_$" />
</Books>
I need to collect all book names and book prices and pass to some other method. Right now, i get all book names and book prices seperately into two different List<string> using the following command:
List<string>BookNameList = root.Elements("Category").Select(x => (string)x.Attribute("BookName")).ToList();
List<string>BookPriceList = root.Elements("Category").Select(x => (string)x.Attribute("BookPrice")).ToList();
I create a text file and send this back to the calling function (stroing these results in a text file is a requirement, the text file has two fields bookname and bookprice).
To write to text file is use following code:
for(int i = 0; i < BookNameList.Count; i++)
{
//write BookNameList[i] to file
// Write BookPriceList[i] to file
}
I somehow dont feel good about this approach. suppose due to any reason both lists of not same size. Right now i do not take that into account and i feel using foreach is much more efficient (I maybe wrong). Is it possible to read both the entries into a datastructure (having two attributes name and price) from LINQ? then i can easily iterate over the list of that datastructure with foreach.
I am using C# for programming.
Thanks,
[Edit]: Thanks everyone for the super quick responses, i choose the first answer which I saw.
Selecting:
var books = root.Elements("Category").Select(x => new {
Name = (string)x.Attribute("BookName"),
Price = (string)x.Attribute("BookPrice")
}).ToList();
Looping:
foreach (var book in books)
{
// do something with
// book.Name
// book.Price
}
I think you could make it more tidy by some very simple means.
A somewhat simplified example follows.
First define the type Book:
public class Book
{
public Book(string name, string price)
{
Name = name;
Price = price;
}
public string Name { get; set; }
public string Price { get; set; } // could be decimal if we want a proper type.
}
Then project your XML data into a sequence of Books, like so:
var books = from category in root.Elements("Category")
select new Book((string) x.Attribute("BookName"), (string) x.Attribute("BookPrice"));
If you want better efficiency I would advice using a XmlReader and writing to the file on every encountered Category, but it's quite involved compared to your approach. It depends on your requirements really, I don't think you have to worry about it too much unless speed is essential or the dataset is huge.
The streamed approach would look something like this:
using (var outputFile = OpenOutput())
using (XmlReader xml = OpenInput())
{
try
{
while (xml.ReadToFollowing("Category"))
{
if (xml.IsStartElement())
{
string name = xml.GetAttribute("BookName");
string price = xml.GetAttribute("BookPrice");
outputFile.WriteLine(string.Format("{0} {1}", name, price));
}
}
}
catch (XmlException xe)
{
// Parse error encountered. Would be possible to recover by checking
// ReadState and continue, this would obviously require some
// restructuring of the code.
// Catching parse errors is recommended because they could contain
// sensitive information about the host environment that we don't
// want to bubble up.
throw new XmlException("Uh-oh");
}
}
Bear in mind that if your nodes have XML namespaces you must register those with the XmlReader through a NameTable or it won't recognize the nodes.
You can do this with a single query and a foreach loop.
var namesAndPrices = from category in root.Elements("Category")
select new
{
Name = category.Attribute("BookName").Value,
Price = category.Attribute("BookPrice").Value
};
foreach (var nameAndPrice in namesAndPrices)
{
// TODO: Output to disk
}
To build on Jeff's solution, if you need to pass this collection into another function as an argument you can abuse the KeyValuePair data structure a little bit and do something along the lines of:
var namesAndPrices = from category in root.Elements("Category")
select new KeyValuePair<string, string>(
Name = category.Attribute("BookName").Value,
Price = category.Attribute("BookPrice").Value
);
// looping that happens in another function
// Key = Name
// Value = Price
foreach (var nameAndPrice in namesAndPrices)
{
// TODO: Output to disk
}

What is the best way to parse html in C#? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.
http://code.google.com/p/fizzler/
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser()
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...
So now that expectations are realistically low, here's the code :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(#"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(#"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(#"^<%.*%>$"); // matches "<%# ... %>"
static Regex _shortHtmlComment = new Regex(#"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.
I think #Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:
SgmlReader
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
usage:
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[#class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.
You could use a HTML DTD, and the generic XML parsing libraries.
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.
You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.
There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.

Looking for C# HTML parser [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.
http://code.google.com/p/fizzler/
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser()
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...
So now that expectations are realistically low, here's the code :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(#"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(#"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(#"^<%.*%>$"); // matches "<%# ... %>"
static Regex _shortHtmlComment = new Regex(#"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.
I think #Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:
SgmlReader
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
usage:
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[#class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.
You could use a HTML DTD, and the generic XML parsing libraries.
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.
You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.
There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.

Categories

Resources