Parse XML with quoted attributes - c#

I'm using a 3rd party XML parser (not my decision) and found it does something bad. Here is the inner part of an XML tag:
"Date=""2014-01-01"" Amounts=""100717.72 100717.72 100717.72 100717.72"""
To parse the attributes, the code does a .split on spaces, ignoring the quotes. This works fine as long as there's no strings with spaces, but here we are. It returns proper Date=2014-01-01 and semi-proper Amounts=100717.72, but then four more entries of just the numbers.
I have the C# code for the parser, and thought about replacing the spaces-inside-quotes with some other character, splitting, and the changing them back. But then I thought I should ask here first.
Is there a way to parse this text into two entries properly?
UPDATE: original XML follows (typed in from another computer, forgive me!)
<DetailAmounts Date="2014-01-01" Amounts="100717.72 100717.72 100717.72 100717.72" />

You should just use XmlSerializer to deserialize the data:
public class DetailAmounts
{
[XmlAttribute]
public DateTime Date { get; set; }
[XmlAttribute]
public string Amounts { get; set; }
}
// ...
var xml = "<DetailAmounts Date=\"2014-01-01\" Amounts=\"100717.72 100717.72 100717.72 100717.72\" />";
var serializer = new XmlSerializer(typeof(DetailAmounts));
using (var reader = new StringReader(xml))
{
var detailAmounts = (DetailAmounts)serializer.Deserialize(reader);
}
Or, you can use XElement to parse each individual values:
var xml = "<DetailAmounts Date=\"2014-01-01\" Amounts=\"100717.72 100717.72 100717.72 100717.72\" />";
var element = XElement.Parse(xml);
var detailAmounts = new
{
Date = (DateTime)element.Attribute("Date"),
Amounts = element.Attribute("Amounts").Value.Split(' ')
.Select(x => decimal.Parse(x, CultureInfo.InvariantCulture))
.ToArray()
};

Related

XMLinvalide chars replacement C#

I have a string that is displayed in XML but in it I have some invalid chars like string
s = <root> something here <XMLElement>hello</XMLElement> somethig here too </root>
where XMLElement is a List like XMLElement = {"bold", "italic",...} .
What I need is to replace the < and </ if followed by any of the XMLElements to be replaced by > or < depending on the cases.
The <root> is to keep
I have tried so far some regEx
strAux = Regex.Replace(strAux, "bold=\"[^\"]*\"",
match => match.Value.Replace("<", "<").Replace(">", ">"));
or
List<string> startsWith = new List<string> { "<", "</"};
foreach(var stw in startsWith)
{
int nextLt = 0;
while ((nextLt = strAux.IndexOf(stw, nextLt)) != -1)
{
bool isMatch = strAux.Substring(nextLt + 1).StartsWith(BoldElement); // needs to ckeck all the XMLElements
//is element, leave it
if (isMatch)
{
//its not, replace
strAux = string.Format(#"{0}<{1}", strAux.Substring(0, nextLt), strAux.Substring(nextLt +1, strAux.Length - (nextLt + 1)));
}
nextLt++;
}
}
Also tried
XmlDocument doc = new XmlDocument();
XmlElement element = doc.CreateElement("root");
element.InnerText = strAux;
Console.WriteLine(element.OuterXml);
strAux = element.OuterXml.Replace("<root>", "").Replace("</root>", "");
return strAux; But it will repeat the `<root>` too
But nothing worked like I suposed. Is there any different ideias .Thanks
What you have is well-formed XML, so you can use the XML APIs to help you:
Using LINQ to XML (which is generally the better API):
var element = XElement.Parse(s);
element.Value = string.Concat(element.Nodes());
var result = element.ToString();
Or using the older XmlDocument API:
var doc = new XmlDocument();
doc.LoadXml(s);
var root = doc.DocumentElement;
root.InnerText = root.InnerXml;
var result = root.OuterXml;
The result for both is:
<root> something here <XMLElement>hello</XMLElement> somethig here too </root>
See this fiddle for a demo.
You should be using the XmlWriter class.
Sample from the documentation:
XmlWriterSettings settings = new XmlWriterSettings();
settings.OmitXmlDeclaration = true;
settings.ConformanceLevel = ConformanceLevel.Fragment;
settings.CloseOutput = false;
// Create the XmlWriter object and write some content.
MemoryStream strm = new MemoryStream();
XmlWriter writer = XmlWriter.Create(strm, settings);
writer.WriteElementString("someNode", "someValue");
writer.Flush();
writer.Close();
https://msdn.microsoft.com/en-us/library/system.xml.xmlwriter(v=vs.110).aspx
It sounds like your input is well-formed XML, but you want to escape some of the tags. The issue here is that there's no way for the code to know which tags are valid and which aren't.
One way to do this is to create a list of valid tags.
List<string> validTags = new List<string>() { "root", "..." };
Then use regex to pick out all instances of <tag> or </tag>and replace them if they're not in the list.
Another way which is faster and easier, but requires more information up front, is to create a list of tags which aren't valid.
List<string> invalidTags = new List<string>() { "XMLElement", "..." };
Simple string manipulation will do, now.
string s = GetYourXMLString();
invalidTags.ForEach(t => s = s.Replace($"</{t}>",$"<{t}>")
.Replace($"<{t}>",$"</{t}>"));
The second way should really only be used if you know which foreign tags are making (or will ever make) an appearance. If not the first approach should be used. One clever possibility is to dynamically create the list of valid tags using reflection or a data contract so that changes to the XML spec will be automatically reflected in your code.
For example, if each element is a property of an object, you might get the list like this:
var validTags = typeof(MyObjectType).GetProperties()
.Select(p => p.PropertyName)
.ToList();
Of course, the property names likely won't be the actual tag names, AND often you'll want to only include certain properties. So you make an attribute class to designate the desired properties (let's call it XMLTagName) and then you can do this:
var validTags = typeof(MyObjectType).GetProperties()
.Select(p => p.GetCustomAttribute<XMLTagName>()?.TagName)
.Where(tagName => tagName != null) //gets rid of properties that aren't tagged
.ToList();
Even with all that, you'll still committing the crime of string manipulation on raw XML. After all, the best real solution here is to figure out how to fix the incoming XML to actually contain the data you want. But if that's not a possibility, the above should do the job.

XML Deserialization of many types

I am deserializing a large xml doc into a C# object.
I've run into an issue where there are multiple xml elements on the same line, and am having trouble re-constructing them properly in code.
A snippet example as so:
<parent>
<ce:para view="all">
Text <ce:cross-ref refid="123">[1]</ce:cross-ref> More Text <ce:italic>Italicized text</ce:italic> and more text here
</ce:para>
<ce:para>...</ce:para>
</parent>
The generated C# class looks like this
[XmlRoot(ElementName = "para", Namespace = "namespace")]
public class Para
{
[XmlElement(ElementName = "cross-ref", Namespace = "namespace")]
public List<Crossref> Crossref { get; set; }
[XmlText]
public List<string> Text { get; set; }
[XmlElement(ElementName = "italic", Namespace = "namespace")]
public List<Italic> Italic { get; set; }
}
I want to be able to loop over this object and re-construct the sentence as a plain string.
Text [1] More Text Italicized Text and more text here
The only problem is though when the deserialization happens, the order is lost as each bit is stuck into it's respective object. This means I have no way of knowing how to reconstruct the string back to how it is supposed to be.
Text: {"Text", "More Text", "and more text here"}
Crossref: {"[1]"}
Italic: {"Italicized Text"}
I've thought about bringing in the whole element in as a string, and then scrubbing the tags out of it, but I'm not sure how to properly get it deserialized. Or if there is a better way to go about it.
Disclaimer: I am not able to alter the XML document as it is coming in from a 3rd party.
Thanks
Once you have deserialized the 3rd party XML into an object that directly matches the XML's schema (as you have done already in your example above) you should be able to use XmlNode.InnerText() on the <ce:para node to extract what you're looking for without having to write any parsing code.
At that point, you could do a translation from the object you deserialized into from the raw 3rd party XML into an object which flattens out the <ce:para node into a simple string.
As per Chris' request, I'm posting my solution. It probably could used refining as I'm not very experienced with linq queries.
XDocument xdoc = xmlAdapter.GetAsXDoc(xmlstring);
IEnumerable<XElement> body = from b in xdoc.Descendants()
where b.Name.LocalName == "body"
select b;
IEnumerable<XElement> sections = from s in body.Descendants()
where s.Name.LocalName == "sections"
select s;
IEnumerable<XElement> paragraphs = from p in sections.Descendants()
where p.Name.LocalName == "para"
select p;
string bodytext = "";
if (paragraphs.Count() > 0)
{
StringBuilder text = new StringBuilder();
foreach (XElement p in paragraphs)
{
text.AppendFormat("{0} ", p.Value);
}
}
bodytext = text.ToString();

Using an XmlTextAttribute on an array of mixed types removes whitespace strings

I am writing a set of objects that must serialize to and from Xml, following a strict specification that I cannot change. One element in this specification can contain a mix of strings and elements in-line.
A simple example of this Xml output would be this:
<root>Leading text <tag>tag1</tag> <tag>tag2</tag></root>
Note the whitespace characters between the closing of the first tag, and the start of the second tag. Here are the objects that represents this structure:
[XmlRoot("root")]
public class Root
{
[XmlText(typeof(string))]
[XmlElement("tag", typeof(Tag))]
public List<object> Elements { get; set; }
//this is simply for the sake of example.
//gives us four objects in the elements array
public static Root Create()
{
Root root = new Root();
root.Elements.Add("Leading text ");
root.Elements.Add(new Tag() { Text = "tag1" });
root.Elements.Add(" ");
root.Elements.Add(new Tag() { Text = "tag2" });
return root;
}
public Root()
{
Elements = new List<object>();
}
}
public class Tag
{
[XmlText]
public string Text {get;set;}
}
Calling Root.Create(), and saving to a file using this method looks perfect:
public XDocument SerializeToXml(Root obj)
{
XmlSerializer serializer = new XmlSerializer(typeof(Root));
XDocument doc = new XDocument();
using (var writer = doc.CreateWriter())
{
serializer.Serialize(writer, obj);
}
return doc;
}
Serialization looks exactly like the xml structure at the beginning of this post.
Now when I want to serialize an xml file back into a Root object, I call this:
public static Root FromFile(string file)
{
XmlSerializer serializer = new XmlSerializer(typeof(Root));
XmlReaderSettings settings = new XmlReaderSettings();
XmlReader reader = XmlTextReader.Create(file, settings);
//whitespace gone here
Root root = serializer.Deserialize(reader) as Root;
return root;
}
The problem is here. The whitespace string is eliminated. When I call Root.Create(), there are four objects in the Elements array. One of them is a space. This serializes just fine, but when deserializing, there are only 3 objects in Elements. The whitespace string gets eliminated.
Any ideas on what I'm doing wrong? I've tried using xml:space="preserve", as well as a host of XmlReader, XmlTextReader, etc. variations. Note that when I use a StringBuilder to read the XmlTextReader, the xml contains the spaces as I'd expect. Only when calling Deserialize(stream) do I lose the spaces.
Here's a link to an entire working example. It's LinqPad friendly, just copy/paste: http://pastebin.com/8MkUQviB The example opens two files, one a perfect serialized xml file, the second being a deserialized then reserialized version of the first file. Note you'll have to reference System.Xml.Serialization.
Thanks for reading this novel. I hope someone has some ideas. Thank you!
It looks like a bug. Workaround seems to be replace all whitespaces and crlf in XML text nodes by
entities. Semantic equal entities (
) does not work.
<root>Leading text <tag>tag1</tag> <tag>tag2</tag></root>
is working for me.

Parsing XML document using .Descendents(value)

I am trying to parse an xml document that I have created. However xml.Descendants(value) doesn't work if value has certain characters (including space, which is my problem).
My xml is structured like this:
<stockists>
<stockistCountry country="Great Britain">
<stockist>
<name></name>
<address></address>
</stockist>
</stockistCountry>
<stockistCountry country="Germany">
<stockist>
<name></name>
<address></address>
</stockist>
</stockistCountry>
...
</stockists>
And my C# code for parsing looks like this:
string path = String.Format("~/Content/{0}/Content/Stockists.xml", Helper.Helper.ResolveBrand());
XElement xml = XElement.Load(Server.MapPath(path));
var stockistCountries = from s in xml.Descendants("stockistCountry")
select s;
StockistCountryListViewModel stockistCountryListViewModel = new StockistCountryListViewModel
{
BrandStockists = new List<StockistListViewModel>()
};
foreach (var stockistCountry in stockistCountries)
{
StockistListViewModel stockistListViewModel = new StockistListViewModel()
{
Country = stockistCountry.FirstAttribute.Value,
Stockists = new List<StockistDetailViewModel>()
};
var stockist = from s in xml.Descendants(stockistCountry.FirstAttribute.Value) // point of failure for 'Great Britain'
select s;
foreach (var stockistDetail in stockist)
{
StockistDetailViewModel stockistDetailViewModel = new StockistDetailViewModel
{
StoreName = stockistDetail.FirstNode.ToString(),
Address = stockistDetail.LastNode.ToString()
};
stockistListViewModel.Stockists.Add(stockistDetailViewModel);
}
stockistCountryListViewModel.BrandStockists.Add(stockistListViewModel);
}
return View(stockistCountryListViewModel);
I am wondering if I am approaching the Xml parsing correctly, whether I shouldn't have spaces in my attributes etc? How to fix it so that Great Britain will parse
However xml.Descendants(value) doesn't work if value has certain characters
XElement.Descendants() expects an XName for the tag, not for the value.
And XML tags are indeed not allowed to contain spaces.
Your sample XML however only contains a value for an attribute, and the space there is fine.
Update:
I think you need
//var stockist = from s in xml.Descendants(stockistCountry.FirstAttribute.Value)
// select s;
var stockists = stockistCountry.Descendants("stockist");

C# newbie: reading repetitive XML to memory

I'm new to C#. I'm building an application that persists an XML file with a list of elements. The structure of my XML file is as follows:
<Elements>
<Element>
<Name>Value</Name>
<Type>Value</Type>
<Color>Value</Color>
</Element>
<Element>
<Name>Value</Name>
<Type>Value</Type>
<Color>Value</Color>
</Element>
<Element>
<Name>Value</Name>
<Type>Value</Type>
<Color>Value</Color>
</Element>
</Elements>
I have < 100 of those items, and it's a single list (so I'm considering a DB solution to be overkill, even SQLite). When my application loads, I want to read this list of elements to memory. At present, after browsing the web a bit, I'm using XmlTextReader.
However, and maybe I'm using it in the wrong way, I read the data tag-by-tag, and thus expect the tags to be in a certain order (otherwise the code will be messy). What I would like to do is read complete "Element" structures and extract tags from them by name. I'm sure it's possible, but how?
To clarify, the main difference is that the way I'm using XmlTextReader today, it's not tolerant to scenarios such as wrong order of tags (e.g. Type comes before Name in a certain Element).
What's the best practice for loading such structures to memory in C#?
It's really easy to do in LINQ to XML. Are you using .NET 3.5? Here's a sample:
using System;
using System.Xml.Linq;
using System.Linq;
class Test
{
[STAThread]
static void Main()
{
XDocument document = XDocument.Load("test.xml");
var items = document.Root
.Elements("Element")
.Select(element => new {
Name = (string)element.Element("Name"),
Type = (string)element.Element("Type"),
Color = (string)element.Element("Color")})
.ToList();
foreach (var x in items)
{
Console.WriteLine(x);
}
}
}
You probably want to create your own data structure to hold each element, but you just need to change the "Select" call to use that.
Any particular reason you're not using XmlDocument?
XmlDocument myDoc = new XmlDocument()
myDoc.Load(fileName);
foreach(XmlElement elem in myDoc.SelectNodes("Elements/Element"))
{
XmlNode nodeName = elem.SelectSingleNode("Name/text()");
XmlNode nodeType = elem.SelectSingleNode("Type/text()");
XmlNode nodeColor = elem.SelectSingleNode("Color/text()");
string name = nodeName!=null ? nodeName.Value : String.Empty;
string type = nodeType!=null ? nodeType.Value : String.Empty;
string color = nodeColor!=null ? nodeColor.Value : String.Empty;
// Here you use the values for something...
}
It sounds like XDocument, and XElement might be better suited for this task. They might not have the absolute speed of XmlTextReader, but for your cases they sound like they would be appropriate and it would make dealing with fixed structures a lot easier. Parsing out elements would work like so:
XDocument xml;
foreach (XElement el in xml.Element("Elements").Elements("Element")) {
var name = el.Element("Name").Value;
// etc.
}
You can even get a bit fancier with Linq:
XDocument xml;
var collection = from el in xml.Element("Elements").Elements("Element")
select new { Name = el.Element("Name").Value,
Color = el.Element("Color").Value,
Type = el.Element("Type").Value
};
foreach (var item in collection) {
// here you can use item.Color, item.Name, etc..
}
You could use XmlSerializer class (http://msdn.microsoft.com/en-us/library/system.xml.serialization.xmlserializer.aspx)
public class Element
{
public string Name { get; set; }
public string Type { get; set; }
public string Color { get; set; }
}
class Program
{
static void Main(string[] args)
{
string xml =
#"<Elements>
<Element>
<Name>Value</Name>
<Type>Value</Type>
<Color>Value</Color>
</Element>(...)</Elements>";
XmlSerializer serializer = new XmlSerializer(typeof(Element[]), new XmlRootAttribute("Elements"));
Element[] result = (Element[])serializer.Deserialize(new StringReader(xml));}
You should check out Linq2Xml, http://www.hookedonlinq.com/LINQtoXML5MinuteOverview.ashx

Categories

Resources