Parse an xml document with "dynamic" nodes - c#

I am parsing XML via an XDocument, how can I retreive all languages, i.e <en> or <de> or <CodeCountry> and their child elements?
<en>
<descriptif>In the historic area, this 16th century Town House on 10,764 sq. ft. features 10 rooms and 3 shower-rooms. Period features include a spiral staircase. 2-room annex house with a vaulted cellar. Period orangery. Ref.: 2913.</descriptif>
<prox>NOGENT-LE-ROTROU.</prox>
<libelle>NOGENT-LE-ROTROU.</libelle>
</en>
<de>
<descriptif>`enter code here`In the historic area, this 16th century Town House on 10,764 sq. ft. features 10 rooms and 3 shower-rooms. Period features include a spiral staircase. 2-room annex house with a vaulted cellar. Period orangery. Ref.: 2913.</descriptif>
<prox>NOGENT-LE-ROTROU.</prox>
</de>
...
<lang>
<descriptif></descriptif>
<prox></prox>
<libelle></libelle>
</lang>

As your xml document is not well formatted, you should first add a root element.
You may do something like that.
var content = File.ReadAllText(#"<path to your xml>");
var test = XDocument.Parse("<Language>" + content + "</Language>");
Then, as you have "dynamic top nodes", you may try to work with their children (which don't seem to be dynamic), assuming all nodes have at least a "descriptif" child. (If it's not "descriptif", it may be "prox" or "libelle") **.
//this will give you all parents, <en>, <de> etc. nodes
var parents = test.Descendants("descriptif").Select(m => m.Parent);
Then you can select the language and childrens.
I used an anonymous type, you can of course project to a custom class.
var allNodes = parents.Select(m => new
{
name = m.Name.LocalName,
Descriptif = m.Element("descriptif") == null ? string.Empty : m.Element("descriptif").Value,
Prox = m.Element("prox") == null ? string.Empty : m.Element("prox").Value ,
Label = m.Element("libelle") == null ? string.Empty : m.Element("libelle").Value
});
This is of course not performant code for a big file, but... that's another problem.
**
Worst case, you may do
var parents = test.Descendants("descriptif").Select(m => m.Parent)
.Union(test.Descendants("prox").Select(m => m.Parent))
.Union(test.Descendants("libelle").Select(m => m.Parent));

Related

How to merge Parent Element and Child element with " : " (colon) in C#

Input Xml:
<title>Discourse interaction between <italic>The New York Times</italic> and <italic>China Daily</italic></title> <subtitle>The case of Google's departure</subtitle>
Required Output:
Discourse interaction between The New York Times and China Daily: The case of Google's departure
My code:
String x = xml.Element("title").Value.Trim();
Now I am getting :
Discourse interaction between The New York Times and China Daily:
<subtitle> is not a child element of <title>, it is a sibling element. You can see this by formatting your containing element xml with indentation:
<someOuterElementNotShown>
<title>Discourse interaction between <italic>The New York Times</italic> and <italic>China Daily</italic></title>
<subtitle>The case of Google's departure</subtitle>
</someOuterElementNotShown>
To get the sibling elements following a given element, use ElementsAfterSelf():
var title = xml.Element("title"); // Add some null check here?
var subtitles = string.Concat(title.ElementsAfterSelf().TakeWhile(e => e.Name == "subtitle").Select(e => e.Value)).Trim();
var x = subtitles.Length > 0 ? string.Format("{0}: {1}", title.Value.Trim(), subtitles) : xml.Value.Trim();
Demo fiddle here.

XML linq need detail info on exception

I am using xml linq on my project. I am dealing with very large xml's for easy understanding purpose I have mentioned small sample xml.
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Body>
<StackOverflowReply xmlns="http://xml.stack.com/RRAND01234">
<processStatus>
<statusCode1>P</statusCode1>
<statusCode2>P</statusCode2>
<statusCode3>P</statusCode3>
<statusCode4>P</statusCode4>
</processStatus>
</StackOverflowReply>
</soap:Body>
Following is C# xml linq
XNamespace x = "http://xml.stack.com/RRAND01234";
var result = from StackOverflowReply in XDocument.Parse(Myxml).Descendants(x + "Security_AuthenticateReply")
select new
{
status1 = StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode1").Value,
status2 = StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode2").Value,
status3 = StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode3").Value,
status4 = StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode4").Value,
status5 = StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode5").Value,
};
Here I am getting exception like "Object reference not set to an instance of an object.". Because the tag
<statusCode5>
was not in my xml.In this case I want to get detail exception message like "Missing tag statusCode5". Please guide me how to get this message from my exception.
There's no easy way (that I'm aware of) to find out exactly what element(s) was/were missing in a LINQ to XML statement. What you can do however is use (string) on the element to handle missing elements - but that can get tricky if you have a chain of elements.
That wouldn't work in your current code:
status5 = (string)StackOverflowReply.Element(x + "processStatus").Element(x + "statusCode5")
Becuase (string) will only work on first element, and the second one is the one that is missing.
You could change your LINQ to focus only on the subnodes, like this:
XNamespace x = "http://xml.stack.com/RRAND01234";
var result = from StackOverflowReply in XDocument.Parse(Myxml).Descendants(x + "processStatus")
select new
{
status1 = (string)StackOverflowReply.Element(x + "statusCode1"),
status2 = (string)StackOverflowReply..Element(x + "statusCode2"),
status3 = (string)StackOverflowReply..Element(x + "statusCode3"),
status4 = (string)StackOverflowReply.Element(x + "statusCode4"),
status5 = (string)StackOverflowReply.Element(x + "statusCode5"),
};
However, if your XML is complex and you have different depths (nested elements), you'll need a more robust solution to avoid a bunch of conditional operator checks or multiple queries.
I have something that might help if that is the case - I'll have to dig it up.
EDIT For More Complex XML
I've had similar challenges with some XML I have to deal with at work. In lieu of an easy way to determine what node was the offending node, and not wanting to have hideously long ternary operators, I wrote an extension method that worked recursively from the specified starting node down to the one I was looking for.
Here's a somewhat simple and contrived example to demonstrate.
<SomeXML>
<Tag1>
<Tag1Child1>Value1</Tag1Child1>
<Tag1Child2>Value2</Tag1Child2>
<Tag1Child3>Value3</Tag1Child3>
<Tag1Child4>Value4</Tag1Child4>
</Tag1>
<Tag2>
<Tag2Child1>
<Tag2Child1Child1>SomeValue1</Tag2Child1Child1>
<Tag2Child1Child2>SomeValue2</Tag2Child1Child2>
<Tag2Child1Child3>SomeValue3</Tag2Child1Child3>
<Tag2Chidl1Child4>SomeValue4</Tag2Child1Child4>
<Tag2Child1>
<Tag2Child2>
<Tag2Child2Child1>
<Tag2Child2Child1Child1 />
<Tag2Child2Child1Child2 />
</Tag2Child2>
</Tag2>
</SomeXML>
In the above XML, I had no way of knowing (prior to parsing) if any of the children elements were empty, so I after some searching and fiddling I came up with the following extension method:
public static XElement GetChildFromPath(this XElement currentElement, List<string> elementNames, int position = 0)
{
if (currentElement == null || !currentElement.HasElements)
{
return currentElement;
}
if (position == elementNames.Count - 1)
{
return currentElement.Element(elementNames[position]);
}
else
{
XElement nextElement = currentElement.Element(elementNames[position]);
return GetChildFromPath(nextElement, elmenentNames, position + 1);
}
}
Basically, the method takes the XElement its called on, plus a List<string> of the elements in path order, with the one I want as the last one, and a position (index in the list), and then works it way down the path until it finds the element in question or runs out of elements in the path. It's not as elegant as I would like it to be, but I haven't had time to refactor it any.
I would use it like this (based on the sample XML above):
MyClass myObj = (from x in XDocument.Parse(myXML).Descendants("SomeXML")
select new MyClass() {
Tag1Child1 = (string)x.GetChildFromPath(new List<string>() {
"Tag1", "Tag1Child1" }),
Tag2Child1Child4 = (string)x.GetChildFromPath(new List<string>() {
"Tag2", "Tag2Child1", "Tag2Child1Child4" }),
Tag2Child2Child1Child2 = (string)x.GetChildFromPath(new List<string>() {
"Tag2", "Tag2Child2", "Tag2Child2Child1",
"Tag2Child2Child1Child2" })
}).SingleOrDefault();
Not as elegant as I'd like it to be, but at least it allows me to parse an XML document that may have missing nodes without blowing chunks. Another option was to do something like:
Tag2Child2Child1Child1 = x.Element("Tag2") == null ?
"" : x.Element("Tag2Child2") == null ?
"" : x.Element("Tag2Child2Child1") == null ?
"" : x.Element("Tag2Child2Child1Child2") == null ?
"" : x.Element("Tag2")
.Element("Tag2Child2")
.Element("Tag2Child2Child1")
.Element("Tag2Child2Child1Child2").Value
That would get really ugly for an object that had dozens of properties.
Anyway, if this is of use to you feel free to use/adapt/modify as you need.

C# to get data from a website

I would like to get the data from this website and put them into a dictionary.
Basically these are prices and quantities for some financial instruments.
I have this source code for the page (here is just an extract of the whole text):
<tr>
<td class="quotesMaxTime1414148558" id="notation115602071"><span>4,000.00</span></td>
<td><span>0</span></td>
<td class="icon red"><span id="domhandler:8.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PERFORMANCE_PCT.wtkm:options_options_snapshot_1">-3.87%</span></td>
<td><span id="domhandler:9.consumer:VALUE-2CCLASS.comp:PREV.gt:green.eq:ZERO.lt:red.resetLt:.resetGt:.resetEq:ZERO.mdgObj:prices-2Fquote-3FVERSION-3D2-26CODE_SELECTOR_PREVIOUS_LAST-3DLATEST-26ID_TYPE_PERFORMANCE-3D7-26ID_TYPE_PRICE-3D1-26ID_QUALITY_PRICE-3D5-26ID_NOTATION-3D115602071.attr:PRICE.wtkm:options_options_snapshot_1">960.40</span></td>
</tr>
Now I would like to extraxt the following information:
The value "4000" from the second line;
The value "-3.87%" from the fourth line;
The value "960.40" from the fifth line.
I have tried to use the following to extract the first information (the value 4000):
string url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var firstData = from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText;
but firstData doesn't contains the info I want (the value 4000) but this:
System.Linq.Enumerable+WhereSelectEnumerableIterator`2[HtmlAgilityPack.HtmlNode,System.String]
How can I get these data? I would also need to repeat this task several times cause in the page there is more than one line containing similar information. Is HTML Agility Pack useful in this context? Thanks.
This may be somewhat ugly but it was quickly thrown together and could probably be cleaned up greatly, but it returns all of the values that you are looking for from the Prices/Quotes table found on that page. hope it helps.
var url = "http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411";
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var pricesAndQuotesDataTable =
(from elem in
document.DocumentNode.Descendants()
.Where(
d =>
d.Attributes["class"] != null && d.Attributes["class"].Value == "toggleTitle" &&
d.ChildNodes.Any(h => h.InnerText != null && h.InnerText == "Prices/Quotes"))
select
elem.Descendants()
.FirstOrDefault(
d => d.Attributes["class"] != null && d.Attributes["class"].Value == "dataTable")).FirstOrDefault();
if (pricesAndQuotesDataTable != null)
{
var dataRows = from elem in pricesAndQuotesDataTable.Descendants()
where elem.Name == "tr" && elem.ParentNode.Name == "tbody"
select elem;
var dataPoints = new List<object>();
foreach (var row in dataRows)
{
var dataColumns = (from col in row.ChildNodes.Where(n => n.Name == "td")
select col).ToList();
dataPoints.Add(
new
{
StrikePrice = dataColumns[0].InnerText,
DifferenceToPreviousDay = dataColumns[9].InnerText,
LastPrice = dataColumns[10].InnerText
});
}
}
That's because your LINQ hasn't executed. If you check the Results View in the debugger and run the query, you'll get all the items, the first being that value you are looking for.
So, this will get you 4,000.00
var firstData = (from x in document.DocumentNode.Descendants()
where x.Name == "td" && x.Attributes.Contains("class")
select x.InnerText).First();
if you want them all, call ToList() instead of First()
if you open to use CSQuery.. then try this one.
static void Main()
{
CsQuery.CQ cq = CsQuery.CQ.CreateFromUrl("http://www.eurexchange.com/action/exchange-en/4744-19066/19068/quotesSingleViewOption.do?callPut=Put&maturityDate=201411");
string str = cq["#notation115602071 span"].Text();
}
You could use the HtmlAgility Pack. Unlike XmlDocument or XDocument, the Html Agility pack is tolerant of malformed HTML (which exists all over the internet and probably on the site you are trying to parse).
Not all HTML pages can be assumed to be valid XML.
With the HTMLAgility pack you can load your page and parse it with XPath or an object model similar to System.Xml.
Html Agility Pack
Optionally, you could use a PDF to Text Converter and parse a text file with much better accuracy, since the website you linked offers a PDF Export of that same data,
PDF Export Link
Convert PDF to Text
We did a similar project a few years back to spider all the major online betting websites and create a comparison tool to get the best prices for each type of event, eg. display all the major bookmakers with betting odds for a particular football game in order of best return.
Turned out to be a complete nightmare- the rendered html output for the websites kept changing almost daily and quite often generated poorly formed html which could sometimes crash the spider daemon, so we had to constantly maintain the system to keep it working properly.
With these sorts of things its often economical to subscribe to a data feed which requires much less maintenance and easier integration.

linq to xml query based on multiple statements in order to serve a previous / next button

I am a newbie to Linq and having difficulties to solve an easy proble..as I 've never done before.
The scenario is a single XML table with books..like :
<?xml version="1.0" encoding="utf-8"?>
<dbproject>
<books_dataset>
<book>
<id>23</id>
<isbn>075221912X</isbn>
<title>Big Brother: The Unseen Story</title>
<author>Jean Ritchie</author>
<publicationYr>2000</publicationYr>
<publisher>Pan Macmillan</publisher>
<pages>169</pages>
<imageBigLink>/images/P/075221912X.01.LZZZZZZZ.jpg</imageBigLink>
<priceActual>0</priceActual>
<numberOfBids>0</numberOfBids>
<sf>kw</sf>
<df></df>
<ef></ef>
<description>Lorem ipsum dolor sit amet</description>
</book>
</books_dataset>
</dbproject>
I am trying to create a query which gives me the ID (next one / first one) of the next/previous book which has a "kw" string in the node.
The IDs are not continuous and there is no index. So for instance a next button is looking for an ID as follows:
Next (higher) ID = Next Book
Which has a "kw" string in
I 've tried many solutions but just got confused :/.
I am able to jump to the next/previous node.. but to be honest I am sure it isn't the best approach to achieve the task.
I am able to list the books which has a kw string but this two requirements do not work together :/
I use this query to ask for a next ID :
var btnNextEval = (from databack in xmlData.Element("dbproject").Elements(QRY).Elements(QRY_sub)
where databack.Element(fid1).Value == trgtCounter.ToString()
select databack).Single().ElementsAfterSelf().First().Element("id").Value;
trgtCounter = Convert.ToInt16(btnNextEval);
I tried to use && to create multiple where but didn't work :/
Please help and show me possible solutions for this silly problem.
Thanks!
Try this:
var nextId = (
from book in xmlData.Elements("book")
let id = (int)book.Element("id")
where ((string)book.Element("sf")) == "kw"
&& id > currentId
select (int)book.Element("id")
).DefaultIfEmpty(-1).Min();
This returns the next ID. To get the book with next ID, do the following:
var nextBook = (
from book in xmlData.Elements("book")
where (int)book.Element("id") == nextId
select book
).First();
Notes:
This assumes there is a variable currentId of type int containing the current id.
You need the DefaultIfEmpty in case there are no ids greater than the current one. In that case, Min will return an error. Using DefaultIfEmpty(-1) will return a single set with -1.
First will also return an error if used on an empty sequence.

Linq to XML: I am not able to compare the nested element

Thank you in advance, this is a great resource.
I believe the code explains itself, but just in case I am being arrogant I will explain myself.
My program lists movies, to a treeview, according to the drop down lists selected genre. Each movie has a few genres, ergo the nested genres.
This is the XML:
<movie>
<title>2012</title>
<director>Roland Emmerich</director>
<writtenBy>
<writter>Roland Emmerich,</writter>
<writter>Harald Kloser</writter>
</writtenBy>
<releaseDate>12-Nov-2009</releaseDate>
<actors>
<actor>John Cusack,</actor>
<actor>Thandie Newton, </actor>
<actor>Chiwetel Ejiofor</actor>
</actors>
<filePath>H:\2012\2012.avi</filePath>
<picPath>~\image\2012.jpg</picPath>
<runningTime>158 min</runningTime>
<plot>Dr. Adrian Helmsley, part of a worldwide geophysical team investigating the effect on the earth of radiation from unprecedented solar storms, learns that the earth's core is heating up. He warns U.S. President Thomas Wilson that the crust of the earth is becoming unstable and that without proper preparations for saving a fraction of the world's population, the entire race is doomed. Meanwhile, writer Jackson Curtis stumbles on the same information. While the world's leaders race to build "arks" to escape the impending cataclysm, Curtis struggles to find a way to save his family. Meanwhile, volcanic eruptions and earthquakes of unprecedented strength wreak havoc around the world. </plot>
<trailer>http://2012-movie-trailer.blogspot.com/</trailer>
<genres>
<genre>Action</genre>
<genre>Adventure</genre>
<genre>Drama</genre>
</genres>
<rated>PG-13</rated>
</movie>
This is the code:
string selectedGenre = this.ddlGenre.SelectedItem.ToString();
XDocument xmldoc = XDocument.Load(Server.MapPath("~/App_Data/movie.xml"));
List<Movie> movies =
(from movie in xmldoc.Descendants("movie")
// The treeView doesn't exist
where movie.Elements("genres").Elements("genre").ToString() == selectedGenre
select new Movie
{
Title = movie.Element("title").Value
}).ToList();
foreach (var movie in movies)
{
TreeNode myNode = new TreeNode();
myNode.Text = movie.Title;
TreeView1.Nodes.Add(myNode);
}
Change your code to
List<Movie> movies =
(from movie in xmldoc.Descendants("movie")
where movie.Elements("genres").Elements("genre").Any(e => e.Value == selectedGenre)
select new Movie
{
Title = movie.Element("title").Value
}).ToList();
This is because there are more than 1 genre node, so you'll have to check if any of them match instead of just the first.
List<Movie> movies =
(from movie in xmldoc.Descendants("movie")
where movie.Elements("genres")
.Any((e) => e.Elements("genre").ToString() == selectedGenre);

Categories

Resources