Validate or parse a string for XML?

Validate or parse a string for XML? - c#

I have a Description textbox on the page. When enter the data in that and submit the page. I will pass that string to XML tag in the XML file.
If user enter any invalid characters in textbox which are not allowed for xml. How to remove or parse them from string? I need to validate string for XML data.

If you're using the XmlDocument or XDocument classes to build the XML then you don't need to worry as they'll do the encoding for you.
Otherwise, if you generating the XML by hand you can use the SecurityElement.Escape method to encode invalid XML characters

That depends on how you are creating the XML. If you are assembling the XML string yourself, there are A LOT of things you should do and take into consideration.
Thus, you should not be doing that (assembling the string yourself).
.NET provides you with abstraction layers so you don't have to deal with that. Example: XDocument

Related

How to check if string is valid xml name

I want quick function which may be part of my xml parser, I do not want to parse whole string and check if it correct xml.

This is not really doable without parsing, or at least—in a limited form—without using a regular expression. Names in XML permit different characters as the first character and as second and further characters — see the Name production.
Should you implement IsValidXmlChar without a context, i.e. just checking if the given character is a NameChar, as per the XML specification, the output of your example would be GridAttributeStuff.
So you should at least tokenize the input text to retrieve valid names, and parse the input to retrieve element names, i.e. output Grid in your example.
To check if a string is a XML name, the XmlReader class offers the IsName static method. To categorize characters in an XML text, there is the XmlCharType struct in .NET Framework as well as in .NET Core, but it's internal.

how to read xml fragment and validate schema

how to parse an xml document and validate the fragment that is not valid using c# ignoring the binary data at the end. is it possible to only parse the xml elements enclosed between the root elements and ignore the binary data.

You can use the XDocument validation methods to validate the document as a whole, then as long as you use the override that embeds the validation information in the XDcoument, you can go back over specific elements and get their validity.
Sorry I don't have any code to hand for this at the moment...

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?

If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

How do I work with an XML tag within a string?

I'm working in Microsoft Visual C# 2008 Express.
Let's say I have a string and the contents of the string is: "This is my <myTag myTagAttrib="colorize">awesome</myTag> string."
I'm telling myself that I want to do something to the word "awesome" - possibly call a function that does something called "colorize".
What is the best way in C# to go about detecting that this tag exists and getting that attribute? I've worked a little with XElements and such in C#, but mostly to do with reading in and out XML files.
Thanks!
-Adeena

Another solution:
var myString = "This is my <myTag myTagAttrib='colorize'>awesome</myTag> string.";
try
{
var document = XDocument.Parse("<root>" + myString + "</root>");
var matches = ((System.Collections.IEnumerable)document.XPathEvaluate("myTag|myTag2")).Cast<XElement>();
foreach (var element in matches)
{
switch (element.Name.ToString())
{
case "myTag":
//do something with myTag like lookup attribute values and call other methods
break;
case "myTag2":
//do something else with myTag2
break;
}
}
}
catch (Exception e)
{
//string was not not well formed xml
}
I also took into account your comment to Dabblernl where you want parse multiple attributes on multiple elements.

You can extract the XML with a regular expression, load the extracted xml string in a XElement and go from there:
string text=#"This is my<myTag myTagAttrib='colorize'>awesome</myTag> text.";
Match match=Regex.Match(text,#"(<MyTag.*</MyTag>)");
string xml=match.Captures[0].Value;
XElement element=XElement.Parse(xml);
XAttribute attribute=element.Attribute("myTagAttrib");
if(attribute.Value=="colorize") DoSomethingWith(element.Value);// Value=awesome
This code will throw an exception if no MyTag element was found, but that can be remedied by inserting a line of:
if(match.Captures.Count!=0)
{...}
It gets even more interesting if the string could hold more than just the MyTag Tag...

I'm a little confused about your example, because you switch between the string (text content), tags, and attributes. But I think what you want is XPath.
So if your XML stream looks like this:
<adeena/><parent><child x="this is my awesome string">This is another awesome string<child/><adeena/>
You'd use an XPath expression that looks like this to find the attribute:
//child/#x
and one like this to find the text value under the child tag:
//child
I'm a Java developer, so I don't know what XML libraries you'd use to do this. But you'll need a DOM parser to create a W3C Document class instance for you by reading in the XML file and then using XPath to pluck out the values.
There's a good XPath tutorial from the W3C schools if you need it.
UPDATE:
If you're saying that you already have an XML stream as String, then the answer is to not read it from a file but from the String itself. Java has abstractions called InputStream and Reader that handle streams of bytes and chars, respectively. The source can be a file, a string, etc. Check your C# DOM API to see if it has something similar. You'll pass the string to a parser that will give back a DOM object that you can manipulate.

Since the input is not well-formed XML you won't be able to parse it with any of the built in XML libraries. You'd need a regular expression to extract the well-formed piece. You could probably use one of the more forgiving HTML parsers like HtmlAgilityPack on CodePlex.

This is my solution to match any type of xml using Regex:
C# Better way to detect XML?

The XmlTextReader can parse XML fragments with a special constructor which may help in this situation, but I'm not positive about that.
There's an in-depth article here:
http://geekswithblogs.net/kobush/archive/2006/04/20/75717.aspx

Protecting from XSLT injection

I use a xsl tranform to convert a xml file to html in dotNet. I transform the node values in the xml to html tag contents and attributes.
I compose the xml by using .Net DOM manipulation, setting the InnerText property of the nodes with the arbitrary and possibly malicious text.
Right now, maliciously crafted input strings will make my html unsafe. Unsafe in the sense that some javascript might come from the the user and find its way to a link href attribute in the output html, for example.
The question is simple, what is the sanitizing, if any, that I have to do with my text before assigning it to the InnerText property? I thought that assigning to InnerText instead of InnerXml would do all the needed sanitization of the text, but that seems to not be the case.
Does my transform have to have any special characteristics to make this work safely? Any .net specific caveats that I should be aware?
Thanks!

You should sanitize your XML before transforming it with XSLT. You probably will need something like:
string encoded = HttpUtility.HtmlEncode("<script>alert('hi')</script>");
XmlElement node = xml.CreateElement("code");
node.InnerText = encoded;
Console.WriteLine(encoded);
Console.WriteLine(node.OuterXml);
With this, you'll get
<script>alert('hi')</script>
When you add this text into your node, you'll get
<code>&lt;script&gt;alert('hi')&lt;/script&gt;</code>
Now, if you run your XSLT, this encoded HTML will not cause any problems in your output.

It turns out that the problem came from the xsl itself, wich used disable-output-escaping. Without that the Tranform itself will do all the encoding necessary.
If you must use disable-output-escaping, you have to use the appriate encodeinf function for each element. HtmlEncode for tag contents, HtmlAttributeEncode for attribute values and UrlEncode for html attribute values (e.g href)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Validate or parse a string for XML? - c#

If you're using the XmlDocument or XDocument classes to build the XML then you don't need to worry as they'll do the encoding for you. Otherwise, if you generating the XML by hand you can use the SecurityElement.Escape method to encode invalid XML characters

Related

How to check if string is valid xml name

how to read xml fragment and validate schema

Reading XML file with Invalid character

How do I work with an XML tag within a string?

Protecting from XSLT injection

Categories

Resources