Checking multiple XML files - c#

I am re-wording this from my original post:
I have two XML files, and they are related to a given year each. For example, 18/19 and 17/18. They conform to the same structure and below is small sample from one of these files. What I want is, in C#, to compare all records in these files where the Given Name, the Family Name, the NI Number and the Date of birth are the same, BUT the Learner Ref Number is different. I need to be able to compare, then push only these records into a data table so I can then push them into a spreadsheet (the spreadsheet bit I can do). I currently have the below as a starting block, but am still very much stuck.
Firstly, I have my Import button press for which:
private void Btn_Import_Click(object sender, RoutedEventArgs e)
{
ILRChecks.ILRReport.CrossYear();}
Then this goes to look at the Class of which eventually pushes the file to my location:
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using ILRValidation;
using InfExcelExtension;
namespace ILRChecks
{
internal static partial class ILRReport
{
internal static void CrossYear()
{
DataSet ds_CrossYearChecks =
ILRValidation.Validation.CrossYearChecks(Global.fileNames);
string output = Path.Combine(Global.foldername, "ULIN_Issues" +
".xlsx");
ds_CrossYearChecks.ToWorkBook(output);
}
}
}
And this is the bit I'm stuck on, which is the production of finding the differences:
using System;
using System.Collections.Generic;
using System.Data;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ILRValidation
{
public static partial class Validation
{
public static DataSet CrossYearChecks(DataSet ds_CrossYearChecks)
{
return CrossYearChecks(ds_CrossYearChecks);
}
public static DataSet CrossYearChecks(string[] xmlPath)
{
DataSet ds_xmlCrossYear = new DataSet();
return CrossYearChecks(ds_xmlCrossYear);
}
}
}
XML:
<Learner>
<LearnRefNumber></LearnRefNumber>
<ULN></ULN>
<FamilyName></FamilyName>
<GivenNames></GivenNames>
<DateOfBirth></DateOfBirth>
<Ethnicity></Ethnicity>
<Sex></Sex>
<LLDDHealthProb></LLDDHealthProb>
<NINumber></NINumber>
<PriorAttain></PriorAttain>
<MathGrade></MathGrade>
<EngGrade></EngGrade>
<PostcodePrior></PostcodePrior>
<Postcode></Postcode>
<AddLine1></AddLine1>
<AddLine3></AddLine3>
<Email></Email>

Well, you can traverse both XML files recursively and write down all the encountered changes. Something like should be helpful:
static string AppendPrefix(string oldPrefix, string addition) =>
oldPrefix == "" ? addition : $"{oldPrefix}.{addition}";
static void CompareElements(string prefix, XElement d1, XElement d2)
{
// 1. compare names
var newPrefix = AppendPrefix(prefix, d1.Name.ToString());
if (d1.Name != d2.Name)
{
Console.WriteLine(
$"Name mismatch: {newPrefix} != {AppendPrefix(prefix, d2.Name.ToString())}");
return;
}
// 2. compare attributes
var attrs = d1.Attributes().OrderBy(a => a.Name);
var unpairedAttributes = new HashSet<XAttribute>(d2.Attributes());
foreach (var attr in attrs)
{
var otherAttr = d2.Attributes(attr.Name).SingleOrDefault();
if (otherAttr == null)
{
Console.WriteLine($"No new attr: {newPrefix}/{attr.Name}");
continue;
}
unpairedAttributes.Remove(otherAttr);
if (attr.Value != otherAttr.Value)
Console.WriteLine(
$"Attr value mismatch: {newPrefix}/{attr.Name}: {attr.Value} != {otherAttr.Value}");
}
foreach (var attr in unpairedAttributes)
Console.WriteLine($"No old attr: {newPrefix}/{attr.Name}");
// 3. compare subelements
var leftNodes = d1.Nodes().ToList();
var rightNodes = d2.Nodes().ToList();
var smallerCount = Math.Min(leftNodes.Count, rightNodes.Count);
for (int i = 0; i < smallerCount; i++)
CompareNodes(newPrefix, i, leftNodes[i], rightNodes[i]);
if (leftNodes.Count > smallerCount)
Console.WriteLine($"Extra {leftNodes.Count - smallerCount} nodes at old file");
if (rightNodes.Count > smallerCount)
Console.WriteLine($"Extra {rightNodes.Count - smallerCount} nodes at new file");
}
static void CompareNodes(string prefix, int index, XNode n1, XNode n2)
{
if (n1.NodeType != n2.NodeType)
{
Console.WriteLine($"Node type mismatch: {prefix}/[{index}]");
return;
}
switch (n1.NodeType)
{
case XmlNodeType.Element:
CompareElements(prefix, (XElement)n1, (XElement)n2);
break;
case XmlNodeType.Text:
CompareText(prefix, index, (XText)n1, (XText)n2);
break;
}
}
static void CompareText(string prefix, int index, XText t1, XText t2)
{
if (t1.Value != t2.Value)
Console.WriteLine($"Text mismatch at {prefix}[{index}]");
}
Usage:
XDocument d1 = <get document #1 from somewhere>,
d2 = <get document #2 from somewhere>;
CompareNodes("", 0, d1.Root, d2.Root);
Obviously, instead of writing to console you should write to the appropriate spreadsheet.
Note that I'm ignoring the attribute reorder but not subnode reorder (which seems to be right).

Seems to me you're having trouble extracting the values you want from the xml, correct?
As the others have mentioned in the comments, without knowing the layout of your xml its impossible to give a specific example for your case. If you edit your question to include an example of your xml, we can help more.
Here are some general examples of how to extract values from xml:
private static bool CheckXmlDocument(string xmlPathCheck)
{
// if you have multiple files from which you need to extract values, pass in an array or List<string> and loop over it, fetching the values
// XmlDocument will allow you to edit the document as well as read it
// there's another option to use XPathDocument and XPathNavigator but it's read-only
var doc = new XmlDocument();
// this can throw various exceptions so might want to add some handling
doc.Load(xmlPathCheck);
// getting the elements, you have some options depending on the layout of the document
// if the nodes you want are identified by 'id' use this:
var nameElement = doc.GetElementById("name");
// if the nodes you want are identified by 'tag', use this:
var nameNodeList = doc.GetElementsByTagName("name");
// if you know the xpath to the specific node you want, use this:
var selectNameNode = doc.SelectSingleNode("the/xpath/to/the/node");
// if there are several nodes that have the same xpaths, use this:
var selectNameList = doc.SelectNodes("the/xpath/that/may/match/many/nodes");
// getting the value depends on the if you have an XmlNode, XmlElement or XmlNodeList
// if you have a single XmlElement or XmlNode you can get the value one of these ways depending on the layout of your document:
var name = nameElement.InnerText;
name = nameElement.InnerXml;
// if you have an XmlNodeList, you'll have to iterate through the nodes to find the one you want, like this:
foreach (var node in nameNodeList)
{
// here use some condition that will determine if its the element/node you want or not (depends on your xml layout)
if (node is XmlNode n)
{
name = n.InnerText;
}
}
// do that for all the values you want to compare, then compare them
return CheckValues(/*the values to compare*/);
}
XmlDocument
XmlNode
XmlElement

Related

C# Comparing an Input to a String Exactly

I have a list of most of the elements in the periodic table in order of their placement on the table:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Chemistry_Element_Calculator
{
class Program
{
static void Main(string[] args)
{
// Declare all numbers
int electronNumber;
int protonNumber;
float neutronNumber;
int i;
// Declare all strings
string elementRequest;
// Create an list for all elements
List<string> elementNameList = new List<string>();
List<string> elementSymbolList = new List<string>();
// Add all elements to the list
elementNameList.Add("Hydrogen"); elementSymbolList.Add("H");
elementNameList.Add("Helium"); elementSymbolList.Add("He");
elementNameList.Add("Lithium"); elementSymbolList.Add("Li");
elementNameList.Add("Beryllium"); elementSymbolList.Add("Be");
elementNameList.Add("Boron"); elementSymbolList.Add("B");
elementNameList.Add("Carbon"); elementSymbolList.Add("C");
elementNameList.Add("Nitrogen"); elementSymbolList.Add("N");
elementNameList.Add("Oxygen"); elementSymbolList.Add("O");
elementNameList.Add("Fluorine"); elementSymbolList.Add("F");
elementNameList.Add("Neon"); elementSymbolList.Add("Ne");
elementNameList.Add("Sodium"); elementSymbolList.Add("Na");
elementNameList.Add("Magnesium"); elementSymbolList.Add("Mg");
elementNameList.Add("Aluminium"); elementSymbolList.Add("Al");
elementNameList.Add("Silicon"); elementSymbolList.Add("Si");
elementNameList.Add("Phosphorus"); elementSymbolList.Add("P");
elementNameList.Add("Sulfur"); elementSymbolList.Add("S");
elementNameList.Add("Chlorine"); elementSymbolList.Add("Cl");
elementNameList.Add("Argon"); elementSymbolList.Add("Ar");
elementNameList.Add("Potassium"); elementSymbolList.Add("K");
elementNameList.Add("Calcium"); elementSymbolList.Add("Ca");
elementNameList.Add("Scandium"); elementSymbolList.Add("Sc");
elementNameList.Add("Titanium"); elementSymbolList.Add("Ti");
elementNameList.Add("Vanadium"); elementSymbolList.Add("V");
elementNameList.Add("Chromium"); elementSymbolList.Add("Cr");
elementNameList.Add("Manganese"); elementSymbolList.Add("Mn");
elementNameList.Add("Iron"); elementSymbolList.Add("Fe");
elementNameList.Add("Cobalt"); elementSymbolList.Add("Co");
elementNameList.Add("Nickel"); elementSymbolList.Add("Ni");
elementNameList.Add("Copper"); elementSymbolList.Add("Cu");
elementNameList.Add("Zinc"); elementSymbolList.Add("Zn");
elementNameList.Add("Gallium"); elementSymbolList.Add("Ga");
elementNameList.Add("Germanium"); elementSymbolList.Add("Ge");
elementNameList.Add("Arsenic"); elementSymbolList.Add("As");
elementNameList.Add("Selenium"); elementSymbolList.Add("Se");
elementNameList.Add("Bromine"); elementSymbolList.Add("Br");
elementNameList.Add("Krypton"); elementSymbolList.Add("Kr");
elementNameList.Add("Rubidium"); elementSymbolList.Add("Rb");
elementNameList.Add("Strontium"); elementSymbolList.Add("Sr");
elementNameList.Add("Yttrium"); elementSymbolList.Add("Y");
elementNameList.Add("Zirconium"); elementSymbolList.Add("Zr");
elementNameList.Add("Niobium"); elementSymbolList.Add("Nb");
elementNameList.Add("Molybdenum"); elementSymbolList.Add("Mo");
elementNameList.Add("Technetium"); elementSymbolList.Add("Tc");
elementNameList.Add("Rubidium"); elementSymbolList.Add("Ru");
elementNameList.Add("Rhodium"); elementSymbolList.Add("Rh");
elementNameList.Add("Palladium"); elementSymbolList.Add("Pd");
elementNameList.Add("Silver"); elementSymbolList.Add("Ag");
elementNameList.Add("Cadmium"); elementSymbolList.Add("Cd");
elementNameList.Add("Indium"); elementSymbolList.Add("In");
elementNameList.Add("Tin"); elementSymbolList.Add("Sn");
elementNameList.Add("Antimony"); elementSymbolList.Add("Sb");
elementNameList.Add("Tellurium"); elementSymbolList.Add("Te");
elementNameList.Add("Iodine"); elementSymbolList.Add("I");
elementNameList.Add("Xenon"); elementSymbolList.Add("Xe");
elementNameList.Add("Caesium"); elementSymbolList.Add("Cs");
elementNameList.Add("Barium"); elementSymbolList.Add("Ba");
elementNameList.Add("Lanthanum"); elementSymbolList.Add("La");
elementNameList.Add("Cerium"); elementSymbolList.Add("Ce");
elementNameList.Add("Praseodynium"); elementSymbolList.Add("Pr");
elementNameList.Add("Neodymium"); elementSymbolList.Add("Nd");
elementNameList.Add("Promethium"); elementSymbolList.Add("Pm");
elementNameList.Add("Samarium"); elementSymbolList.Add("Sm");
elementNameList.Add("Europium"); elementSymbolList.Add("Eu");
elementNameList.Add("Gadolinium"); elementSymbolList.Add("Gd");
elementNameList.Add("Terbium"); elementSymbolList.Add("Tb");
elementNameList.Add("Dysprosium"); elementSymbolList.Add("Dy");
elementNameList.Add("Holomium"); elementSymbolList.Add("Ho");
elementNameList.Add("Erbium"); elementSymbolList.Add("Er");
elementNameList.Add("Thulium"); elementSymbolList.Add("Tm");
elementNameList.Add("Ytterbium"); elementSymbolList.Add("Yb");
elementNameList.Add("Lutenium"); elementSymbolList.Add("Lu");
elementNameList.Add("Hafnium"); elementSymbolList.Add("Hf");
elementNameList.Add("Tantalum"); elementSymbolList.Add("Ta");
elementNameList.Add("Tungsten"); elementSymbolList.Add("W");
elementNameList.Add("Rhenium"); elementSymbolList.Add("Re");
elementNameList.Add("Osmium"); elementSymbolList.Add("Os");
elementNameList.Add("Iridium"); elementSymbolList.Add("Ir");
elementNameList.Add("Platinum"); elementSymbolList.Add("Pt");
elementNameList.Add("Gold"); elementSymbolList.Add("Au");
elementNameList.Add("Mercury"); elementSymbolList.Add("Hg");
elementNameList.Add("Thallium"); elementSymbolList.Add("Tl");
elementNameList.Add("Lead"); elementSymbolList.Add("Pb");
elementNameList.Add("Bismuth"); elementSymbolList.Add("Bi");
elementNameList.Add("Polonium"); elementSymbolList.Add("Po");
elementNameList.Add("Astatine"); elementSymbolList.Add("At");
elementNameList.Add("Radon"); elementSymbolList.Add("Rn");
elementNameList.Add("Francium"); elementSymbolList.Add("Fr");
elementNameList.Add("Radium"); elementSymbolList.Add("Ra");
elementNameList.Add("Actinium"); elementSymbolList.Add("Ac");
elementNameList.Add("Thorium"); elementSymbolList.Add("Th");
elementNameList.Add("Palladium"); elementSymbolList.Add("Pa");
elementNameList.Add("Uranium"); elementSymbolList.Add("U");
elementNameList.Add("Nepturium"); elementSymbolList.Add("Np");
elementNameList.Add("Plutonium"); elementSymbolList.Add("Pu");
elementNameList.Add("Americium"); elementSymbolList.Add("Am");
elementNameList.Add("Curium"); elementSymbolList.Add("Cm");
elementNameList.Add("Berkelium"); elementSymbolList.Add("Bk");
elementNameList.Add("Californium"); elementSymbolList.Add("Cf");
elementNameList.Add("Einsteinium"); elementSymbolList.Add("Es");
elementNameList.Add("Fermium"); elementSymbolList.Add("Fermium");
elementNameList.Add("Mendelevium"); elementSymbolList.Add("Md");
elementNameList.Add("Nobelium"); elementSymbolList.Add("No");
elementNameList.Add("Lawrencium"); elementSymbolList.Add("Lr");
elementNameList.Add("Rutherfordium"); elementSymbolList.Add("Rf");
elementNameList.Add("Dubnium"); elementSymbolList.Add("Db");
elementNameList.Add("Seaborgium"); elementSymbolList.Add("Sg");
elementNameList.Add("Bohrium"); elementSymbolList.Add("Bh");
elementNameList.Add("Hassium"); elementSymbolList.Add("Hs");
elementNameList.Add("Meitnerium"); elementSymbolList.Add("Mt");
elementNameList.Add("Darmstadtium"); elementSymbolList.Add("Ds");
elementNameList.Add("Roentgenium"); elementSymbolList.Add("Rg");
elementNameList.Add("Copernicium"); elementSymbolList.Add("Cn");
Console.WriteLine("What element do you want? Either input it's full name, with a capital letter, or it's elemnent symbol. E.g. N for Nitrogen");
elementRequest = Console.ReadLine();
elementNameList.ForEach(delegate (String elementName)
{
if (elementRequest == elementName)
{
Console.WriteLine("Hydrogen");
}
else
{
Console.WriteLine("Not Hydrogen");
};
});
Console.Read();
However, when I run the program, and input Hydrogen, both Helium and Hydrogen are said to be hydrogen. How can I fix this?
Also, if anyone has an idea on how to compress the 2 lists so they're smaller, let me know :)
Thanks :)
What you want your code to do is take the index from the first list, and return the string at the same index in the second list.
Your code is currently not doing any of that, but always printing "Hydrogen", your output thus always being "Hydrogen" whatever element is requested.
You can trivially fix that by actually looking up the index:
int indexOfElementName = elementNameList.IndexOf(elementRequest);
string elementSymbol = elementSymbolList[indexOfElementName];
Note that that does not handle casing and requested elements that aren't in the list (or typos).
But keeping the two lists in sync, and this code in general, is a maintenance disaster waiting to happen.
Instead you could use a dictionary where the element name is the key and the symbol the value:
var elementDictionary = new Dictionary<string, string>
{
{ "Hydrogen", "H" },
{ "Helium", "He" },
// ...
}
Then look it up:
string elementSymbol = elementDictionary[elementRequest];
Do note that this still doesn't handle case-insensitivity and elements that are not found, but I'll leave that as an exercise to you.
I would use a custom class and a list as suggested by ElectricRouge.
I don't favor the dictionary because you need to search both by name and symbol. Also, the data set is small (118 elements to date).
See comments for the explanation of the code.
using System;
using System.Collections.Generic;
namespace Chemistry_Element_Calculator
{
// Create a chemical element class
// You can add more properties such as number of electrons, etc.
public class ChemicalElement
{
public string Symbol
{
get; set;
}
public string Name
{
get; set;
}
}
class Program
{
static void Main(string[] args)
{
// Create a list of the elements, populate their properties
var elements = new List<ChemicalElement>()
{
new ChemicalElement {Name = "Hydrogen", Symbol = "H"},
new ChemicalElement {Name = "Helium", Symbol = "He"},
new ChemicalElement {Name = "Lithium", Symbol = "Li"}
// etc.
};
Console.WriteLine("What element do you want? Either input it's full name, with a capital letter, or it's elemnent symbol. E.g. N for Nitrogen");
var elementRequest = Console.ReadLine();
// Use find to get a matching element, compare both name and symbol
var foundElement = elements.Find(element => element.Symbol == elementRequest || element.Name == elementRequest);
if (foundElement == null)
{
// Output if no element is found
Console.WriteLine("Element Not Found");
}
else
{
// Output if the element is found
Console.WriteLine("Found element {0}.", foundElement.Name);
}
Console.WriteLine("[Press any key to finish]");
Console.ReadKey();
}
}
}
The code uses the method List.Find to get the result, it will only return the first match.
I would also suggest to migrate to String.Compare if you want to have case-insensitive comparison.
And finally I would be reading the data form file or a database, but that is beyond the question.

How to extract specific data from XML data

I am using the following code snippet to parse and convert some XML data to CSV. I can convert the entire XML data and dump it into a file, however my requirements have changed and now I'm confused.
public void xmlToCSVfiltered(string p, int e)
{
string all_lines1 = File.ReadAllText(p);
all_lines1 = "<Root>" + all_lines1 + "</Root>";
XmlDocument doc_all = new XmlDocument();
doc_all.LoadXml(all_lines1);
StreamWriter write_all = new StreamWriter(FILENAME2);
XmlNodeList rows_all = doc_all.GetElementsByTagName("XML");
List<string[]> filtered = new List<string[]>();
foreach (XmlNode rowtemp in rows_all)
{
List<string> children_all = new List<string>();
foreach (XmlNode childtemp in rowtemp.ChildNodes)
{
children_all.Add(Regex.Replace(childtemp.InnerText, "\\s+", " ")); // <------- Fixed the Bug , Advisories dont span
}
string.Join(",", children_all.ToArray());
//write_all.WriteLine(string.Join(",", children_all.ToArray()));
if (children_all.Contains(e.toString()))
{
filtered.Add(children_all.ToArray());
write_all.WriteLine(children_all);
}
}
write_all.Flush();
write_all.Close();
foreach (var res in filtered)
{
Console.WriteLine(string.Join(",", res));
}
}
My input looks something like the following... My objective now is to only convert those "events" and compile into a CSV which have a certain number. Lets say, for example, I only want to convert to CSV those events who's 2nd data value under element <EVENT> is 4627. It would only convert those events and in the case of the input below, both mentioned below.
<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
tham ALL out. For some reason
that is not the case
please press the on button
when trying to activate
device codes also available on
list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>
<XML><HEADER>2.0,773162,20121009133435,3,</HEADER>20121004133435,761,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,18735166156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
tham ALL out. For some reason
that is not the case
please press the on button
when trying to activate
device codes also available on
list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>
.. goes on
What my approach has been so far is to convert everything to CSV and store it in some sort of data structure and then query that data structure line by line and look if that number exists and if yes, write it to a file line by line. My function takes the path of the XML file and the number we are looking for in the XML data as parameters. I'm new to C# and I cannot understand how I would go about changing my function above. Any help will be appreciated!
EDIT:
Sample Input:
<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
tham ALL out. For some reason
that is not the case
please press the on button
when trying to activate
device codes also available on
list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-
<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4623,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell
tham ALL out. For some reason
that is not the case
please press the on button
when trying to activate
device codes also available on
list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-
Required Output:
1.0,770162,20121009133435,3,,20121009133435,721,5,1,0,0,0,00:00,00:00,,00032134 26064957,4627,1,,1872161156,7,0,10000,1,0,5000000,0,10000000,0,1 ,,Keep it simple or spell
tham ALL out. For some reason
that is not the case
please press the on button
when trying to activate
device codes also available on
list,,,20121009133435,00-1d-71-0a-71-80,-66,,,0,50
The above will be the case if I call xmlToCSVfiltered(file, 4627);
Also note that, the output will be a single horizontal line as in CSV files but I can't really format it here for it to look like that.
I changed XmlDocumnet to XDocument so I can use Xml Linq. I also for testing used a StringReader to read the string instead of reading from a file. You can convert code back to your original File.ReadAlltext.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME2 = #"c:\temp\test.txt";
static void Main(string[] args)
{
string input =
"<XML><HEADER>1.0,770162,20121009133435,3,</HEADER>20121009133435,721,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,1872161156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell\n" +
"tham ALL out. For some reason \n" +
"that is not the case\n" +
"please press the on button\n" +
"when trying to activate\n" +
"device codes also available on\n" +
"list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>\n" +
"<XML><HEADER>2.0,773162,20121009133435,3,</HEADER>20121004133435,761,5,1,0,0,0,00:00,00:00,<EVENT>00032134826064957,4627,</EVENT><DRUG>1,18735166156,7,0,10000</DRUG><DOSE>1,0,5000000,0,10000000,0</DOSE><CAREAREA>1 </CAREAREA><ENCOUNTER></ENCOUNTER><ADVISORY>Keep it simple or spell\n" +
"tham ALL out. For some reason\n" +
"that is not the case\n" +
"please press the on button\n" +
"when trying to activate\n" +
"device codes also available on\n" +
"list</ADVISORY><CAREGIVER></CAREGIVER><PATIENT></PATIENT><LOCATION>20121009133435,00-1d-71-0a-71-80,-66</LOCATION><ROUTE></ROUTE><SITE></SITE><POWER>0,50</POWER></XML>\n";
xmlToCSVfiltered(input, 4627);
}
static public void xmlToCSVfiltered(string p, int e)
{
//string all_lines1 = File.ReadAllText(p);
StringReader reader = new StringReader(p);
string all_lines1 = reader.ReadToEnd();
all_lines1 = "<Root>" + all_lines1 + "</Root>";
XDocument doc_all = XDocument.Parse(all_lines1);
StreamWriter write_all = new StreamWriter(FILENAME2);
List<XElement> rows_all = doc_all.Descendants("XML").Where(x => x.Element("EVENT").Value.Split(new char[] {','}).Skip(1).Take(1).FirstOrDefault() == e.ToString()).ToList();
List<string[]> filtered = new List<string[]>();
foreach (XElement rowtemp in rows_all)
{
List<string> children_all = new List<string>();
foreach (XElement childtemp in rowtemp.Elements())
{
children_all.Add(Regex.Replace(childtemp.Value, "\\s+", " ")); // <------- Fixed the Bug , Advisories dont span
}
string.Join(",", children_all.ToArray());
//write_all.WriteLine(string.Join(",", children_all.ToArray()));
if (children_all.Contains(e.ToString()))
{
filtered.Add(children_all.ToArray());
write_all.WriteLine(children_all);
}
}
write_all.Flush();
write_all.Close();
foreach (var res in filtered)
{
Console.WriteLine(string.Join(",", res));
}
}
}
}
​
I have made some assumptions since it was not clear to me from the question
Assumptions
1. I am assuming you know that you need to check node event and you need to second position element from there.
2. You know the delimiter between the values in node. for eg. ',' here in events
public void xmlToCSVfiltered(string p, int e, string nodeName, char delimiter)
{
//get the xml node
XDocument xml = XDocument.Load(p);
//get the required node. I am assuming you would know. For eg. Event Node
var requiredNode = xml.Descendants(nodeName);
foreach (var node in requiredNode)
{
if (node == null)
continue;
//Also here, I am assuming you have the delimiter knowledge.
var valueSplit = node.Value.Split(delimiter);
foreach (var value in valueSplit)
{
if (value == e.ToString())
{
AddToCSV();
}
}
}
}

How do I merge two XDocuments removing duplicates

I have two XML files (*.resx files) that I am trying to merge in to one removing duplicates, but am unable to do so. I've tried the following without any success:
var resource1 = XDocument.Load("C:\\Resources.resx");
var resource2 = XDocument.Load("C:\\Resources2.resx");
// This results in a file with all the nodes from the second file included inside
// the root element of the first file to form a properly formatted, concatenated file.
resource1.Descendants().FirstOrDefault().Add(resource2.Descendants().FirstOrDefault().Nodes());
var nodeContent = new List<string>();
foreach (XElement node in resource1.Root.Elements())
{
if (nodeContent.Contains(node.ToString()))
resource1.Remove();
else
nodeContent.Add(node.ToString());
}
resource1.Save("C:\\FinalResources.resx");
On the remove statement I get an InvalidOperationException - "The parent is missing.":
Am I doing something wrong?
You need to define an EqualityComparer<XElement> that will enable you to use the standard LINQ operators.
So, as a simple example I created this:
public class ElementComparer : EqualityComparer<XElement>
{
public override int GetHashCode(XElement xe)
{
return xe.Name.GetHashCode() ^ xe.Value.GetHashCode();
}
public override bool Equals(XElement xe1, XElement xe2)
{
var #return = xe1.Name.Equals(xe2.Name);
if (#return)
{
#return = xe1.Value.Equals(xe2.Value);
}
return #return;
}
}
So I can then start with these two XML documents:
<xs>
<x>D</x>
<x>A</x>
<x>B</x>
</xs>
<xs>
<x>E</x>
<x>B</x>
<x>C</x>
</xs>
And do this:
var xml1 = XDocument.Parse(#"<xs><x>D</x><x>A</x><x>B</x></xs>");
var xml2 = XDocument.Parse(#"<xs><x>E</x><x>B</x><x>C</x></xs>");
xml1.Root.Add(
xml2.Root.Elements("x")
.Except(xml1.Root.Elements("x"), new ElementComparer()));
Then xml1 will look like this:
<xs>
<x>D</x>
<x>A</x>
<x>B</x>
<x>E</x>
<x>C</x>
</xs>
Well, the most straight forward way is:
var resource1 = XDocument.Load("C:\\Resources.resx");
var resource2 = XDocument.Load("C:\\Resources2.resx");
foreach (XElement node in resource2.Root.Elements())
{
if (resource1.Root.Contains(node)) continue;
resource1.Add(node);
}
resource1.Save("C:\\FinalResources.resx");
public static class XElementExtensions
{
public static bool Contains(this XElement root, XElement e)
{
//or w/e equality logic you need
return root.Elements().Any(x => x.ToString().Equals(e.ToString()));
}
}
This will only merge first level entries tho. If you need deep merge, then you will have to set up a simple recursion (using the same loop for child elements).
resource1.Remove(); is called twice and what it does is remove the root element. So the second time there is no longer a root element to remove from and thus throwing the exception.

How can I transform XML into a List<string> or String[]?

How can I transform the following XML into a List<string> or String[]:
<Ids>
<id>1</id>
<id>2</id>
</Ids>
It sounds like you're more after just parsing rather than full XML serialization/deserialization. If you can use LINQ to XML, this is pretty easy:
using System;
using System.Linq;
using System.Xml.Linq;
public class Test
{
static void Main()
{
string xml = "<Ids><id>1</id><id>2</id></Ids>";
XDocument doc = XDocument.Parse(xml);
var list = doc.Root.Elements("id")
.Select(element => element.Value)
.ToList();
foreach (string value in list)
{
Console.WriteLine(value);
}
}
}
In fact the call to Elements could omit the argument as there are only id elements, but I thought I'd demonstrate how to specify which elements you want.
Likewise I'd normally not bother calling ToList unless I really needed a List<string> - without it, the result is IEnumerable<string> which is fine if you're just iterating over it once. To create an array instead, use ToArray.
Here is a way using XmlDocument :
// A string containing the XML data
string xml = "<Ids><id>1</id><id>2</id></Ids>";
// The list you want to fill
ArrayList list = new ArrayList();
XmlDocument doc = new XmlDocument();
// Loading from a XML string (use Load() for file)
doc.LoadXml(xml);
// Selecting node using XPath syntax
XmlNodeList idNodes = doc.SelectNodes("Ids/id");
// Filling the list
foreach (XmlNode node in idNodes)
list.Add(node.InnerText);
With any type of collection.
For example :
<Ids>
<id>C:\videotest\file0.txt</id>
<id>C:\videotest\file1.txt</id>
</Ids>
class FileFormat
{
public FileFormat(string path)
{
this.path = path;
}
public string FullPath
{
get { return Path.GetFullPath(path); }
}
public string FileName
{
get { return Path.GetFileName(path); }
}
private string path;
}
List<FileFormat> Files =
selectedNode
.Descendants("Ids")
.Elements("Id")
.Select(x => new FileFormat(x.Value))
.Where(s => s.FileName!=string.Empty)
.ToList();
the code to convert a string array to xml
items is a string array
items.Aggregate("", (current, x) => current + ("<item>" + x + "</item>"))
This sample will work with the .NET framework 3.5:
System.Xml.Linq.XElement element = System.Xml.Linq.XElement.Parse("<Ids> <id>1</id> <id>2</id></Ids>");
System.Collections.Generic.List<string> ids = new System.Collections.Generic.List<string>();
foreach (System.Xml.Linq.XElement childElement in element.Descendants("id"))
{
ids.Add(childElement.Value);
}
This sample will work with the .NET framework 4.0:
into a List
List<string> Ids= new List<string>();
Ids= selectedNode.Descendants("Ids").Elements("Id").Select(> x=>x.Value).Where(s =>s!= string.Empty).ToList();
into a string []
string[] Ids= thisNode
.Descendants("Ids") // Ids element
.Elements("Id") // Id elements
.Select(x=>x.Value) // XElement to string
.Where(s =>s!= string.Empty) // must be not empty
.ToArray(); // string to string []
You can use Properties class.
Properties prop = new Properties();
prop.loadFromXML(stream);
Set set = prop.keySet();
You can than access Strings from set for each key. Key is element name of xml.
Here's a way to get typed array from xml by using DataSets. (in this example array of doubles)
DataSet dataSet = new DataSet()
DoubleArray doubles = new DoubleArray(dataSet,0);
dataSet.ReadXml("my.xml");
double a = doubles[0];
public class DoubleArray
{
DataSet dataSet;
int tableIndex;
public DoubleArray(DataSet dataSet,int tableIndex)
{
this.dataSet=dataSet;
this.tableIndex=tableIndex;
}
public double this[int index]
{
get
{
object ret = dataSet.Tables[tableIndex].Rows[index];
if(ret is double?)
return (ret as double?).Value;
else
return double.Parse(ret as string);
}
set
{
object out = dataSet.Tables[tableIndex].Rows[index];
if(out is double?)
dataSet.Tables[tableIndex].Rows[index] = (double?)value;
else
dataSet.Tables[tableIndex].Rows[index] = value.ToString();
}
}
}
Of course parsing doubles and converting them back to strings all the time might be considered as blasphemy by some programmers... Even for me it was hard not to think about such waste of resources... but I guess sometimes it's better to just turn another away.. don't stress it :)

What is the best way to parse html in C#? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for a library/method to parse an html file with more html specific features than generic xml parsing libraries.
Html Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
You could use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.
Another alternative would be to use the builtin engine mshtml:
using mshtml;
...
object[] oPageText = { html };
HTMLDocument doc = new HTMLDocumentClass();
IHTMLDocument2 doc2 = (IHTMLDocument2)doc;
doc2.write(oPageText);
This allows you to use javascript-like functions like getElementById()
I found a project called Fizzler that takes a jQuery/Sizzler approach to selecting HTML elements. It's based on HTML Agility Pack. It's currently in beta and only supports a subset of CSS selectors, but it's pretty damn cool and refreshing to use CSS selectors over nasty XPath.
http://code.google.com/p/fizzler/
You can do a lot without going nuts on 3rd-party products and mshtml (i.e. interop). use the System.Windows.Forms.WebBrowser. From there, you can do such things as "GetElementById" on an HtmlDocument or "GetElementsByTagName" on HtmlElements. If you want to actually inteface with the browser (simulate button clicks for example), you can use a little reflection (imo a lesser evil than Interop) to do it:
var wb = new WebBrowser()
... tell the browser to navigate (tangential to this question). Then on the Document_Completed event you can simulate clicks like this.
var doc = wb.Browser.Document
var elem = doc.GetElementById(elementId);
object obj = elem.DomElement;
System.Reflection.MethodInfo mi = obj.GetType().GetMethod("click");
mi.Invoke(obj, new object[0]);
you can do similar reflection stuff to submit forms, etc.
Enjoy.
I've written some code that provides "LINQ to HTML" functionality. I thought I would share it here. It is based on Majestic 12. It takes the Majestic-12 results and produces LINQ XML elements. At that point you can use all your LINQ to XML tools against the HTML. As an example:
IEnumerable<XNode> auctionNodes = Majestic12ToXml.Majestic12ToXml.ConvertNodesToXml(byteArrayOfAuctionHtml);
foreach (XElement anchorTag in auctionNodes.OfType<XElement>().DescendantsAndSelf("a")) {
if (anchorTag.Attribute("href") == null)
continue;
Console.WriteLine(anchorTag.Attribute("href").Value);
}
I wanted to use Majestic-12 because I know it has a lot of built-in knowledge with regards to HTML that is found in the wild. What I've found though is that to map the Majestic-12 results to something that LINQ will accept as XML requires additional work. The code I'm including does a lot of this cleansing, but as you use this you will find pages that are rejected. You'll need to fix up the code to address that. When an exception is thrown, check exception.Data["source"] as it is likely set to the HTML tag that caused the exception. Handling the HTML in a nice manner is at times not trivial...
So now that expectations are realistically low, here's the code :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Majestic12;
using System.IO;
using System.Xml.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Majestic12ToXml {
public class Majestic12ToXml {
static public IEnumerable<XNode> ConvertNodesToXml(byte[] htmlAsBytes) {
HTMLparser parser = OpenParser();
parser.Init(htmlAsBytes);
XElement currentNode = new XElement("document");
HTMLchunk m12chunk = null;
int xmlnsAttributeIndex = 0;
string originalHtml = "";
while ((m12chunk = parser.ParseNext()) != null) {
try {
Debug.Assert(!m12chunk.bHashMode); // popular default for Majestic-12 setting
XNode newNode = null;
XElement newNodesParent = null;
switch (m12chunk.oType) {
case HTMLchunkType.OpenTag:
// Tags are added as a child to the current tag,
// except when the new tag implies the closure of
// some number of ancestor tags.
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
currentNode = newNode as XElement;
}
break;
case HTMLchunkType.CloseTag:
if (m12chunk.bEndClosure) {
newNode = ParseTagNode(m12chunk, originalHtml, ref xmlnsAttributeIndex);
if (newNode != null) {
currentNode = FindParentOfNewNode(m12chunk, originalHtml, currentNode);
newNodesParent = currentNode;
newNodesParent.Add(newNode);
}
}
else {
XElement nodeToClose = currentNode;
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
while (nodeToClose != null && nodeToClose.Name.LocalName != m12chunkCleanedTag)
nodeToClose = nodeToClose.Parent;
if (nodeToClose != null)
currentNode = nodeToClose.Parent;
Debug.Assert(currentNode != null);
}
break;
case HTMLchunkType.Script:
newNode = new XElement("script", "REMOVED");
newNodesParent = currentNode;
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Comment:
newNodesParent = currentNode;
if (m12chunk.sTag == "!--")
newNode = new XComment(m12chunk.oHTML);
else if (m12chunk.sTag == "![CDATA[")
newNode = new XCData(m12chunk.oHTML);
else
throw new Exception("Unrecognized comment sTag");
newNodesParent.Add(newNode);
break;
case HTMLchunkType.Text:
currentNode.Add(m12chunk.oHTML);
break;
default:
break;
}
}
catch (Exception e) {
var wrappedE = new Exception("Error using Majestic12.HTMLChunk, reason: " + e.Message, e);
// the original html is copied for tracing/debugging purposes
originalHtml = new string(htmlAsBytes.Skip(m12chunk.iChunkOffset)
.Take(m12chunk.iChunkLength)
.Select(B => (char)B).ToArray());
wrappedE.Data.Add("source", originalHtml);
throw wrappedE;
}
}
while (currentNode.Parent != null)
currentNode = currentNode.Parent;
return currentNode.Nodes();
}
static XElement FindParentOfNewNode(Majestic12.HTMLchunk m12chunk, string originalHtml, XElement nextPotentialParent) {
string m12chunkCleanedTag = CleanupTagName(m12chunk.sTag, originalHtml);
XElement discoveredParent = null;
// Get a list of all ancestors
List<XElement> ancestors = new List<XElement>();
XElement ancestor = nextPotentialParent;
while (ancestor != null) {
ancestors.Add(ancestor);
ancestor = ancestor.Parent;
}
// Check if the new tag implies a previous tag was closed.
if ("form" == m12chunkCleanedTag) {
discoveredParent = ancestors
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("td" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "tr" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("tr" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => !("table" == XE.Name
|| "thead" == XE.Name
|| "tbody" == XE.Name
|| "tfoot" == XE.Name))
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
else if ("thead" == m12chunkCleanedTag
|| "tbody" == m12chunkCleanedTag
|| "tfoot" == m12chunkCleanedTag) {
discoveredParent = ancestors
.TakeWhile(XE => "table" != XE.Name)
.Where(XE => m12chunkCleanedTag == XE.Name)
.Take(1)
.Select(XE => XE.Parent)
.FirstOrDefault();
}
return discoveredParent ?? nextPotentialParent;
}
static string CleanupTagName(string originalName, string originalHtml) {
string tagName = originalName;
tagName = tagName.TrimStart(new char[] { '?' }); // for nodes <?xml >
if (tagName.Contains(':'))
tagName = tagName.Substring(tagName.LastIndexOf(':') + 1);
return tagName;
}
static readonly Regex _startsAsNumeric = new Regex(#"^[0-9]", RegexOptions.Compiled);
static bool TryCleanupAttributeName(string originalName, ref int xmlnsIndex, out string result) {
result = null;
string attributeName = originalName;
if (string.IsNullOrEmpty(originalName))
return false;
if (_startsAsNumeric.IsMatch(originalName))
return false;
//
// transform xmlns attributes so they don't actually create any XML namespaces
//
if (attributeName.ToLower().Equals("xmlns")) {
attributeName = "xmlns_" + xmlnsIndex.ToString(); ;
xmlnsIndex++;
}
else {
if (attributeName.ToLower().StartsWith("xmlns:")) {
attributeName = "xmlns_" + attributeName.Substring("xmlns:".Length);
}
//
// trim trailing \"
//
attributeName = attributeName.TrimEnd(new char[] { '\"' });
attributeName = attributeName.Replace(":", "_");
}
result = attributeName;
return true;
}
static Regex _weirdTag = new Regex(#"^<!\[.*\]>$"); // matches "<![if !supportEmptyParas]>"
static Regex _aspnetPrecompiled = new Regex(#"^<%.*%>$"); // matches "<%# ... %>"
static Regex _shortHtmlComment = new Regex(#"^<!-.*->$"); // matches "<!-Extra_Images->"
static XElement ParseTagNode(Majestic12.HTMLchunk m12chunk, string originalHtml, ref int xmlnsIndex) {
if (string.IsNullOrEmpty(m12chunk.sTag)) {
if (m12chunk.sParams.Length > 0 && m12chunk.sParams[0].ToLower().Equals("doctype"))
return new XElement("doctype");
if (_weirdTag.IsMatch(originalHtml))
return new XElement("REMOVED_weirdBlockParenthesisTag");
if (_aspnetPrecompiled.IsMatch(originalHtml))
return new XElement("REMOVED_ASPNET_PrecompiledDirective");
if (_shortHtmlComment.IsMatch(originalHtml))
return new XElement("REMOVED_ShortHtmlComment");
// Nodes like "<br <br>" will end up with a m12chunk.sTag==""... We discard these nodes.
return null;
}
string tagName = CleanupTagName(m12chunk.sTag, originalHtml);
XElement result = new XElement(tagName);
List<XAttribute> attributes = new List<XAttribute>();
for (int i = 0; i < m12chunk.iParams; i++) {
if (m12chunk.sParams[i] == "<!--") {
// an HTML comment was embedded within a tag. This comment and its contents
// will be interpreted as attributes by Majestic-12... skip this attributes
for (; i < m12chunk.iParams; i++) {
if (m12chunk.sTag == "--" || m12chunk.sTag == "-->")
break;
}
continue;
}
if (m12chunk.sParams[i] == "?" && string.IsNullOrEmpty(m12chunk.sValues[i]))
continue;
string attributeName = m12chunk.sParams[i];
if (!TryCleanupAttributeName(attributeName, ref xmlnsIndex, out attributeName))
continue;
attributes.Add(new XAttribute(attributeName, m12chunk.sValues[i]));
}
// If attributes are duplicated with different values, we complain.
// If attributes are duplicated with the same value, we remove all but 1.
var duplicatedAttributes = attributes.GroupBy(A => A.Name).Where(G => G.Count() > 1);
foreach (var duplicatedAttribute in duplicatedAttributes) {
if (duplicatedAttribute.GroupBy(DA => DA.Value).Count() > 1)
throw new Exception("Attribute value was given different values");
attributes.RemoveAll(A => A.Name == duplicatedAttribute.Key);
attributes.Add(duplicatedAttribute.First());
}
result.Add(attributes);
return result;
}
static HTMLparser OpenParser() {
HTMLparser oP = new HTMLparser();
// The code+comments in this function are from the Majestic-12 sample documentation.
// ...
// This is optional, but if you want high performance then you may
// want to set chunk hash mode to FALSE. This would result in tag params
// being added to string arrays in HTMLchunk object called sParams and sValues, with number
// of actual params being in iParams. See code below for details.
//
// When TRUE (and its default) tag params will be added to hashtable HTMLchunk (object).oParams
oP.SetChunkHashMode(false);
// if you set this to true then original parsed HTML for given chunk will be kept -
// this will reduce performance somewhat, but may be desireable in some cases where
// reconstruction of HTML may be necessary
oP.bKeepRawHTML = false;
// if set to true (it is false by default), then entities will be decoded: this is essential
// if you want to get strings that contain final representation of the data in HTML, however
// you should be aware that if you want to use such strings into output HTML string then you will
// need to do Entity encoding or same string may fail later
oP.bDecodeEntities = true;
// we have option to keep most entities as is - only replace stuff like
// this is called Mini Entities mode - it is handy when HTML will need
// to be re-created after it was parsed, though in this case really
// entities should not be parsed at all
oP.bDecodeMiniEntities = true;
if (!oP.bDecodeEntities && oP.bDecodeMiniEntities)
oP.InitMiniEntities();
// if set to true, then in case of Comments and SCRIPT tags the data set to oHTML will be
// extracted BETWEEN those tags, rather than include complete RAW HTML that includes tags too
// this only works if auto extraction is enabled
oP.bAutoExtractBetweenTagsOnly = true;
// if true then comments will be extracted automatically
oP.bAutoKeepComments = true;
// if true then scripts will be extracted automatically:
oP.bAutoKeepScripts = true;
// if this option is true then whitespace before start of tag will be compressed to single
// space character in string: " ", if false then full whitespace before tag will be returned (slower)
// you may only want to set it to false if you want exact whitespace between tags, otherwise it is just
// a waste of CPU cycles
oP.bCompressWhiteSpaceBeforeTag = true;
// if true (default) then tags with attributes marked as CLOSED (/ at the end) will be automatically
// forced to be considered as open tags - this is no good for XML parsing, but I keep it for backwards
// compatibility for my stuff as it makes it easier to avoid checking for same tag which is both closed
// or open
oP.bAutoMarkClosedTagsWithParamsAsOpen = false;
return oP;
}
}
}
The Html Agility Pack has been mentioned before - if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience.
I think #Erlend's use of HTMLDocument is the best way to go. However, I have also had good luck using this simple library:
SgmlReader
No 3rd party lib, WebBrowser class solution that can run on Console, and Asp.net
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using System.Threading;
class ParseHTML
{
public ParseHTML() { }
private string ReturnString;
public string doParsing(string html)
{
Thread t = new Thread(TParseMain);
t.ApartmentState = ApartmentState.STA;
t.Start((object)html);
t.Join();
return ReturnString;
}
private void TParseMain(object html)
{
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "feces of a dummy"; //;magic words
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html);
this.ReturnString = doc.Body.InnerHtml + " do here something";
return;
}
}
usage:
string myhtml = "<HTML><BODY>This is a new HTML document.</BODY></HTML>";
Console.WriteLine("before:" + myhtml);
myhtml = (new ParseHTML()).doParsing(myhtml);
Console.WriteLine("after:" + myhtml);
The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier (as you mention you could use a general XML parser). Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it. It almost needs to be done on a site-by-site basis.
I've used ZetaHtmlTidy in the past to load random websites and then hit against various parts of the content with xpath (eg /html/body//p[#class='textblock']). It worked well but there were some exceptional sites that it had problems with, so I don't know if it's the absolute best solution.
You could use a HTML DTD, and the generic XML parsing libraries.
Use WatiN if you need to see the impact of JS on the page [and you're prepared to start a browser]
Depending on your needs you might go for the more feature-rich libraries. I tried most/all of the solutions suggested, but what stood out head & shoulders was Html Agility Pack. It is a very forgiving and flexible parser.
Try this script.
http://www.biterscripting.com/SS_URLs.html
When I use it with this url,
script SS_URLs.txt URL("http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c")
It shows me all the links on the page for this thread.
http://sstatic.net/so/all.css
http://sstatic.net/so/favicon.ico
http://sstatic.net/so/apple-touch-icon.png
.
.
.
You can modify that script to check for images, variables, whatever.
I wrote some classes for parsing HTML tags in C#. They are nice and simple if they meet your particular needs.
You can read an article about them and download the source code at http://www.blackbeltcoder.com/Articles/strings/parsing-html-tags-in-c.
There's also an article about a generic parsing helper class at http://www.blackbeltcoder.com/Articles/strings/a-text-parsing-helper-class.

Categories

Resources