Custom XML-like Syntax Parsing - c#

I'm attempting to replicate a dialogue system from a game that has control codes, which are HTML/XML-like tags that dictate behavior of a text bubble. For example, changing the color of a piece of text would be like <co FF0000FF>Hello World!</co>. These control codes are not required in the text, so Hello <co FF0000FF>World!</co> or simply Hello World should parse as well.
I've attempted to make it similar to XML to ease parsing, but XML requires a root-level tag to parse successfully, and the text may or may not have any control codes. For example, I'm able to parse the following fine with XElement.
string Text = "<co value=\"FF0000FF\">Hello World!</co>"
XElement.Parse(Text);
However, the following fails with an XMLException ("Data at the root level is invalid. Line 1, position 1."):
string Text = "Hello <co value=\"FF0000FF\">World!</co>"
XElement.Parse(Text);
What would be a good approach to handling this? Is there a way to handle parsing XML elements in a string without requiring a strict XML syntax, or is there another type of parser I can use to achieve what I want?

If the only difference between your XML-like fragments and real XML is the absence of a root element, then simply wrap the fragment in a dummy root element before parsing:
parse("<dummy>" + fragment + "</dummy>")
If there are other differences, for example attributes not being in quotes, or attribute names starting with a digit, then an XML parser isn't going to be much use to you, you will need to write your own. Or an HTML parser such as validator.nu might handle it, if you're lucky.

You can try with HtmlAgilityPack
Install Nuget packge by firing this command Install-Package HtmlAgilityPack
The following sample will return all the child nodes. I did not pass any level to Descendants but you can further put more code as per need.
It will parse your custom format.
string Text = "Hello <co value=\"FF0000FF\">World!</co>";
Text = System.Net.WebUtility.HtmlDecode(Text);
HtmlDocument result = new HtmlDocument();
result.LoadHtml(Text);
List<HtmlNode> nodes = result.DocumentNode.Descendants().ToList();

If the XML elements within your text will always be well-formed, then you can use the XML libraries to do this.
You can either wrap your text inside a root element and use XElement.Parse and read the child nodes, or you can use some lower level bits to allow you to parse the nodes in an XML fragment:
public static IEnumerable<XNode> Parse(string text)
{
var settings = new XmlReaderSettings
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (var sr = new StringReader(text))
using (var xr = XmlReader.Create(sr, settings))
{
xr.MoveToContent();
while (xr.EOF == false)
{
yield return XNode.ReadFrom(xr);
}
}
}
Using it like this:
foreach (var node in Parse("Hello <co value=\"FF0000FF\">World!</co>"))
{
Console.WriteLine($"{node.GetType().Name}: {node}");
}
Would output this:
XText: Hello
XElement: <co value="FF0000FF">World!</co>
See this fiddle for a working demo.

Related

How to read xml string ignoring header?

I want to read a xml string ignoring the header and the comments.
To ignore the comments it's simples and I found a solution here.
But I'm not finding any solution to ignore the header.
Let me give an example:
Consider this xml:
<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Some comments -->
<Tag Attribute="3">
...
</Tag>
I want to read the xml to a string obtaining just the element "Tag" and others elements but withou the "xml version" and the comments.
The element "Tag" is only an example. Could exist many others.
So, I want only this:
<Tag Attribute="3">
...
</Tag>
The code that I've come so far:
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
XmlReader reader = XmlReader.Create("...", settings);
xmlDoc.Load(reader);
And I'm not finding anything on XmlReaderSettings to do that.
Do I need to go node by node choosing only the ones I want? This setting does not exist?
EDIT 1:
Just to resume my problem. I need the contents of the xml to use in a CDATA of a WebService. When I'm sending comments or xml version, I'm getting an specific error of that part of xml. So I assume that when I read the xml without the version, header and comments I'll be good to go.
Here's a really simple solution.
using (var reader = XmlReader.Create(/*reader, stream, etc.*/)
{
reader.MoveToContent();
string content = reader.ReadOuterXml();
}
Well, it seems that there is no settings to ignore declaration, so I had to ignore it myself.
Here's the code I've written for those who might be interested:
private string _GetXmlWithoutHeadersAndComments(XmlDocument doc)
{
string xml = null;
// Loop through the child nodes and consider all but comments and declaration
if (doc.HasChildNodes)
{
StringBuilder builder = new StringBuilder();
foreach (XmlNode node in doc.ChildNodes)
if (node.NodeType != XmlNodeType.XmlDeclaration && node.NodeType != XmlNodeType.Comment)
builder.Append(node.OuterXml);
xml = builder.ToString();
}
return xml;
}
If you want to only get the Tag elements, you should just read the XML as normal, then find them using the XmlDocument's XPath capabilities.
For your xmlDoc object:
var nodes = xmlDoc.DocumentElement.SelectNodes("Tag");
You can then iterate through these like so:
foreach (XmlNode node in nodes) { }
Or, obviously, you could just put your SelectNodes query into the foreach loop, if you're never going to reuse the nodes object.
This will return all Tag elements within your XML document, and you can do whatever you see fit with them.
There's no need to ever encounter comments while using XmlDocument if you don't want to, and you're not going to end up getting results including either the header or the comments. Is there a particular reason you're trying to remove pieces of the XML before you begin parsing it?
Edit: Based on your edit, it seems like you're having a problem with the header giving an error when you try to pass it. You probably shouldn't straight-up remove the header, so your best option might be to change the header to one that you know works. You can change the header (declaration) like so:
XmlDeclaration xmlDeclaration;
xmlDeclaration = yourDocument.CreateXmlDeclaration(
yourVersion,
yourEncoding,
isStandalone);
yourDocument.ReplaceChild(xmlDeclaration, doc.FirstChild);

How to read in text file and parse it visibly as XML format using C#

I have a data file that comes from a client, and it is not parsed correctly how you would assume it to be for being human readable. The tags in it are <statement> tags, and there are no line breaks. So it looks like the example below:
<statement><tag1></tag1><tag2></tag2>...and so on until </statement>
<statement><tag1></tag1><tag2></tag2>...and so on until </statement>
<statement><tag1></tag1><tag2></tag2>...and so on until </statement>
Is there a quick way I can just parse this defining the root element, and re save the data file so it is parsed how you would assume to view an xml document such as the following:
<statement>
<tag1></tag1>
<tag2>
<Tag2A></tag2A>
</tag2>
</statement>
Thanks in advance. I am new to working with XML, and so learning the tools for it. Currently, I am reading the file in by lines File.ReadLines, and then looping through doing an XML Parse() like the following:
foreach (String item in lines)
{
XElement xElement = XElement.Parse(item);
sr.WriteLine(xElement.ToString().Trim());
}
This is taking over half of the processing time! Is there a quicker or better way to handle this.
Instead of reading the lines as strings and them parsing them, it may be a better idea to parse the whole document at once, as an XDocument (assuming it's a valid XML document):
var doc = XDocument.Load(fileName);
foreach (var xElement in doc.Root.Elements())
{
sr.WriteLine(xElement.ToString().Trim());
}
Or, if you want to include the root element's tags:
var doc = XDocument.Load(fileName);
sr.WriteLine(doc.Root);

C# XMLDocument Encoding?

I'm trying to code a function that validates an XML settings file, so if a node does not exist on the file, it should create it.
I have this function
private void addMissingSettings() {
XmlDocument xmldocSettings = new XmlDocument();
xmldocSettings.Load("settings.xml");
XmlNode xmlMainNode = xmldocSettings.SelectSingleNode("settings");
XmlNode xmlChildNode = xmldocSettings.CreateElement("ExampleNode");
xmlChildNode.InnerText = "Hello World!";
//add to parent node
xmlMainNode.AppendChild(xmlChildNode);
xmldocSettings.Save("settings.xml");
}
But on my XML file, if I have
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Write Suffix"></wPortSuffix>
When the I save the document, it saves those lines as
<rPortSuffix desc="Read Suffix">
</rPortSuffix>
<wPortSuffix desc="Sufijo en puerto de escritura"></wPortSuffix>
<ExampleNode>Hello World!</ExampleNode>
Is there a way to prevent this behaviour? Like setting a working charset or something like that?
The two files are equivalent, and should be treated as being equivalent by all XML parsers, I believe.
Additionally, Unicode character U+0003 isn't a valid XML character, so you've fundamentally got other problems if you're trying to represent it in your file. Even though that particular .NET XML parser doesn't seem to object, other parsers may well do so.
If you need to represent absolutely arbitrary characters in your XML, I suggest you do so in some other form - e.g.
<rPortSuffix desc="Read Suffix">\u000c\u000a</rPortSuffix>
<wPortSuffix desc="Write Suffix">\u0003</wPortSuffix>
Obviously you'll then need to parse that text appropriately, but at least the XML parser won't get in the way, and you'll be able to represent any UTF-16 code unit.

How to iterate over xml using linq2xml or Xquery

I have an incoming file with data as
<root><![CDATA[<defs><elements>
<element><item>aa</item><int>1</int></element>
<element><item>bb</item><int>2</int></element>
<element><item>cc</item><int>3</int></element>
</elements></defs>]]></root>
writing multiple foreach( xElement x in root.Elements ) seems superfluous !
looking for a less verbose method preferably using C#
UPDATE - yes - the input is in a CDATA, rest assured it's not my design and i have ZERO control over it !
Assuming that nasty CDATA section is intentional, and you're only interested in the text content of your leaf elements, you can do something like:
XElement root = XElement.Load(yourFile);
var data = from element in XElement.Parse(root.Value).Descendants("element")
select new {
Item = element.Elements("item").First().Value,
Value = element.Elements("int").First().Value
};
That said, if the code that generates your input file is under your control, consider getting rid of the CDATA section. Storing XML within XML that way is not the way to go most of the time, as it defeats the purpose of the markup language (and requires multiple parser passes, as shown above).

How do I work with an XML tag within a string?

I'm working in Microsoft Visual C# 2008 Express.
Let's say I have a string and the contents of the string is: "This is my <myTag myTagAttrib="colorize">awesome</myTag> string."
I'm telling myself that I want to do something to the word "awesome" - possibly call a function that does something called "colorize".
What is the best way in C# to go about detecting that this tag exists and getting that attribute? I've worked a little with XElements and such in C#, but mostly to do with reading in and out XML files.
Thanks!
-Adeena
Another solution:
var myString = "This is my <myTag myTagAttrib='colorize'>awesome</myTag> string.";
try
{
var document = XDocument.Parse("<root>" + myString + "</root>");
var matches = ((System.Collections.IEnumerable)document.XPathEvaluate("myTag|myTag2")).Cast<XElement>();
foreach (var element in matches)
{
switch (element.Name.ToString())
{
case "myTag":
//do something with myTag like lookup attribute values and call other methods
break;
case "myTag2":
//do something else with myTag2
break;
}
}
}
catch (Exception e)
{
//string was not not well formed xml
}
I also took into account your comment to Dabblernl where you want parse multiple attributes on multiple elements.
You can extract the XML with a regular expression, load the extracted xml string in a XElement and go from there:
string text=#"This is my<myTag myTagAttrib='colorize'>awesome</myTag> text.";
Match match=Regex.Match(text,#"(<MyTag.*</MyTag>)");
string xml=match.Captures[0].Value;
XElement element=XElement.Parse(xml);
XAttribute attribute=element.Attribute("myTagAttrib");
if(attribute.Value=="colorize") DoSomethingWith(element.Value);// Value=awesome
This code will throw an exception if no MyTag element was found, but that can be remedied by inserting a line of:
if(match.Captures.Count!=0)
{...}
It gets even more interesting if the string could hold more than just the MyTag Tag...
I'm a little confused about your example, because you switch between the string (text content), tags, and attributes. But I think what you want is XPath.
So if your XML stream looks like this:
<adeena/><parent><child x="this is my awesome string">This is another awesome string<child/><adeena/>
You'd use an XPath expression that looks like this to find the attribute:
//child/#x
and one like this to find the text value under the child tag:
//child
I'm a Java developer, so I don't know what XML libraries you'd use to do this. But you'll need a DOM parser to create a W3C Document class instance for you by reading in the XML file and then using XPath to pluck out the values.
There's a good XPath tutorial from the W3C schools if you need it.
UPDATE:
If you're saying that you already have an XML stream as String, then the answer is to not read it from a file but from the String itself. Java has abstractions called InputStream and Reader that handle streams of bytes and chars, respectively. The source can be a file, a string, etc. Check your C# DOM API to see if it has something similar. You'll pass the string to a parser that will give back a DOM object that you can manipulate.
Since the input is not well-formed XML you won't be able to parse it with any of the built in XML libraries. You'd need a regular expression to extract the well-formed piece. You could probably use one of the more forgiving HTML parsers like HtmlAgilityPack on CodePlex.
This is my solution to match any type of xml using Regex:
C# Better way to detect XML?
The XmlTextReader can parse XML fragments with a special constructor which may help in this situation, but I'm not positive about that.
There's an in-depth article here:
http://geekswithblogs.net/kobush/archive/2006/04/20/75717.aspx

Categories

Resources