How to replace xml special chars manually?

How to replace xml special chars manually? - c#

My application produces an xml file that is then xslt transformed into a nice html report. I have a problem with \n however. There are some xslt techniques to do it, but they are pretty awkward and time consuming.
So my solution was to do a string.replace \n to
< br />
and then to force the xmlWriter to write this with WriteRaw(text). The problem is that the text sometimes has some illegal chars like >.
I am unable to find any utility method in .net that just takes in a string and transforms it in a xml-friendly string. I looked with the reflector and the class that handles this logic is not public.
Any ideeas (beside writing my own code to do this)?

Never, ever use string manipulation to produce XML. It's not just that it makes poorly-socialized people laugh at you: it leads to code that has bugs in it that you don't know exist.
Think about it from a test-driven perspective. You've written a method that uses string manipulation to generate XML. Okay, now write a test case for it that demonstrates that it will never emit poorly-formed XML. In order for the test to prove this, you have to test every possible scenario outlined in the XML recommendation. Do you know the XML recommendation well enough to be able to assert that your test does this?
No, you don't. And you don't really want to, not unless you're writing a framework for XML generation. That's why you use the classes in System.Xml to generate XML. The people who wrote them did that work so that you don't have to.
Tomalak showed how to do what you're trying to do with XSLT. If you're using an XmlWriter to generate the XML, use this pattern:
string s = "replace\nnewlines\nwith\nbreaks";
string[] lines = s.Split('\n');
for (int i=0; i<lines.Length; i++)
{
xw.WriteString(lines[i]);
if (i<lines.Length - 1)
{
xw.WriteElementString("br", "", "");
}
}
This uses string manipulation where it's appropriate - when manipulating string data outside of XML - and doesn't where it's not - when producing XML text.

I think the solution to this question will help you:
xslt replace \n with <br/> only in one node?
You can incorporate the provided template into your transformation process, and you're done without getting your hands dirty.

Related

Parse XML with regards to one namespace only

I need to parse XML files with regards to only one namespace.
By "with regards to only one namespace" I mean that if I have document like this:
<xc:document xmlns:xc="asdasd">
<asdf>
<xc:abcd />
</asdf>
</xc:document>
I would like <asdf>, </asdf> to be treated as text.
The structure of this document should look like this:
document
|
|- text (<asdf>)
|- abcd
|- text (</asdf>)
What is the simplest method to achieve this?

Transform the document with xslt first so that the nodes you want treated as text actually are text.

Pretty much any XML parser is going to lose distinctions like whether single or double quotes were used, or CDATA sections were used, or whitespace inside tags (not between tags).
So:
<boy socks="black"
></boy>
might come back as <boy socks='black'/>
If you want to treat the input as not XML, you'll have to fall back on non-XML tools, or rethink your situation entirely, as this is a very unusual thing to want to do.
It's fairly easy in a text-processing language such as Perl, if you are careful. For example,
perl -p -e 's#<(/?[^:]+[\s>])#\<$1#g'
will go a long way, by changing the < signs you want to treat as text into < instead. This approach actually works best if you read the whole file in Perl rather than (as in this example) a line at a time, so that you can match close tags spread over multiple lines,
</boy
> like this.
But, best to parse XML with an XML parser, not regular expressions, so if the sort of changes I mentioned above are OK, this is really easy to do in XSLT.

What are the possible special characters which need to be handled in creating XML?

I am writing an XML parser; my application creates XML files. For this I have to handle special characters -- for example I know that < should be replaced with <, similarly > should be replaced with >, and so on. What are all the different characters which need to be handled in this way?

See this wikipedia article:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
(unless you're doing it for academic purposes, I recommend you use the existing .Net Xml parsing libraries, such as those in the System.Xml namespace, or System.Xml.Linq. If you are trying to serialize/deserialize objects, use the built in Xml serialization)

For XML parsing you don't need to perform those replacements - you'd need to perform them when creating XML. You'd also want to consider replacing & with & where required - see the XML specification for details.
However, I would strongly advise you not to write your own XML API. .NET already contains several of them, including the excellent LINQ to XML. Use that instead of building your own. The chances of you independently creating your own XML API which is of a similar quality are very low, and you'll spend an awful lot of time getting there to start with.
Using a decent XML API, you don't need to worry about character conversions etc - the API will handle them for you.

There is a list of XML escape codes listed here.
Use the System.XML.XMLConvert class to handle special characters for you:
class Program
{
static void Main(string[] args)
{
string s;
s = System.Xml.XmlConvert.EncodeName("valid XML --> !##$%^&*()");
Console.WriteLine("Encoded: {0}", s);
Console.WriteLine("Decoded: {0}",System.Xml.XmlConvert.DecodeName(s));
Console.ReadLine();
}
}
Will yield this result:
Encoded:
valid_x0020_XML_x0020_--_x003E__x0020__x0021__x0040__x0023__x0024__x002
5__x005E__x0026__x002A__x0028__x0029_
Decoded: valid XML --> !##$%^&*()

There is a built in .NET method SecurityElement.Escape for escaping certain (not all) invalid XML characters. Check out this link:
http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape%28v=VS.80%29.aspx

Reading XML file with Invalid character

I am using Dataset.ReadXML() to read an XML string. I get an error as the XML string contains the Invalid Character 0x1F which is 'US' - Unit seperator. This is contained within fully formed tags.
The data is extracted from an Oracle DB, using a Perl script. How would be the best way to escape this character so that the XML is read correctly.
EDIT: XML String:
<RESULT>
<DEPARTMENT>Oncology</DEPARTMENT>
<DESCRIPTION>Oncology</DESCRIPTION>
<STUDY_NAME>**7360C hsd**</STUDY_NAME>
<STUDY_ID>27</STUDY_ID>
</RESULT>
Is between the C and h in the bold part, is where there is a US seperator, which when pasted into this actually shows a space. So I want to know how can I ignore that in an XML string?

If you look at section 2.2 of the XML recommendation, you'll see that x01F is not in the range of characters allowed in XML documents. So while the string you're looking at may look like an XML document to you, it isn't one.
You have two problems. The relatively small one is what to do about this document. I'd probably preprocess the string and discard any character that's not legal in well-formed XML, but then I don't know anything about the relatively large problem.
And the relatively large problem is: what's this data doing in there in the first place? What purpose (if any) do non-visible ASCII characters in the middle of a (presumably) human-readable data field serve? Why is it doesn't the Perl script that produces this string failing when it encounters an illegal character?
I'll bet you one American dollar that it's because the person who wrote that script is using string manipulation and not an XML library to emit the XML document. Which is why, as I've said time and again, you should never use string manipulation to produce XML. (There are certainly exceptions. If you're writing a throwaway application, for instance, or an XML parser. Or if your name's Tim Bray.)

Your XmlReader/TextReader must be created with correct encoding. You can create it as below and pass to your Dataaset:
StreamReader reader = new StreamReader("myfile.xml",Encoding.ASCII); // or correct encoding
myDataset.ReadXml(reader);

Regex or XML Parser C#

I have some word templates(dot/dotx) files that contain xml tags along with plain text.
At run time, I need to replace the xml tags with their respective mail merge fields.
So, need to parse the document for these xml tags and replace them with merge fields.
I was using Regex to find and replace these xml tags. But I was suggested to use XML parser to parse for XML tags ([Regex for string enclosed in <*>, C#).
The sample document looks like:
Solicitor Letter
<Tfirm/>
<Tbuilding/>
<TstreetNumber/> <TstreetName/>
For the attention of: <TContact1/> <TEmail/>
Dear <TContact1/>
RE: <Pbuilding/> <PstreetNumber/> <PstreetName/> <Pvillage/> <PTown/>
We were pleased to hear that contracts have now been exchanged in the sale of the
above property on behalf of our mutual client/s. We now have pleasure in enclosing a
copy of our invoice for your kind attention upon completion.
....
One more note, the angle brackets are typed manually by end user in the template.
I tried using XMLReader, but got error as my documents have no root tags on their own.
Please guide if I should stick to Regex or is there any way to use XML Parser.
Thank you!

Unless you can get it structured as an XML document, the tools in the .NET Libraries to read XML are going to be entirely useless.
What you have is not XML. Having a tag or two that would qualify as XML does not an XML document make. The problem is that it simply does not follow any of the rules of XML.
Moral of the story is that you will have to come up with your own method to parse this. If you like to drink the RegEx kool-aid, that'll be the best solution for ya. Of course, there are plenty of ways to skin this cat.

It looks like you aren't actually using XML, just using a token that looks similar to XML as a placeholder for replacement.
If that's the case, you should be using Regex.

I would suggest neither. Microsoft has a free library in C# specifically for modifying open xml format documents without an installation of Microsoft Office.
OpenXML SDK

Doesn't seem like XML processing to me. It's not an XML doc. It's looks like straight string-replacement, and for that, you're better off with a Regular Expression.

An XML parser doesn't help you locate XML; it only helps you understand a given piece of XML. You will need some other mechanism, perhaps a Regex, to find the XML.

Seems that authors of most replies didnt read the question carefully.
inutan is asking for something that will parse Word documents. If a Word document is saved in docx format, it will be actually XML file that can be read by XML Reader or XPathReader, however I will not recomend to do it
Normally, mail merge with Word doesnt require any programming and XML parsing, see http://helpdesk.ua.edu/training/word/merg07.html
However if you still want to have XML-like fields in your Word templates and replace them with values, I would suggest using Word automation objects.
Below is an example of VBA code, for a similar code on other languages please refer MS Office development site http://msdn.microsoft.com/en-us/library/bb726434.aspx . For example if you use .NET - you should use Office interops and best of all is to install MS Visual Studio Tools for Office development http://msdn.microsoft.com/en-us/library/5s12ew2x.aspx
With Selection.Find
.Text = "<TContact1/>"
.Replacement.Text = "TContact1"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll

Best method of Textfile Parsing in C#?

I want to parse a config file sorta thing, like so:
[KEY:Value]
[SUBKEY:SubValue]
Now I started with a StreamReader, converting lines into character arrays, when I figured there's gotta be a better way. So I ask you, humble reader, to help me.
One restriction is that it has to work in a Linux/Mono environment (1.2.6 to be exact). I don't have the latest 2.0 release (of Mono), so try to restrict language features to C# 2.0 or C# 1.0.

I considered it, but I'm not going to use XML. I am going to be writing this stuff by hand, and hand editing XML makes my brain hurt. :')
Have you looked at YAML?
You get the benefits of XML without all the pain and suffering. It's used extensively in the ruby community for things like config files, pre-prepared database data, etc
here's an example
customer:
name: Orion
age: 26
addresses:
- type: Work
number: 12
street: Bob Street
- type: Home
number: 15
street: Secret Road
There appears to be a C# library here, which I haven't used personally, but yaml is pretty simple, so "how hard can it be?" :-)
I'd say it's preferable to inventing your own ad-hoc format (and dealing with parser bugs)

I was looking at almost this exact problem the other day: this article on string tokenizing is exactly what you need. You'll want to define your tokens as something like:
#"(?&ltlevel>\s) | " +
#"(?&ltterm>[^:\s]) | " +
#"(?&ltseparator>:)"
The article does a pretty good job of explaining it. From there you just start eating up tokens as you see fit.
Protip: For an LL(1) parser (read: easy), tokens cannot share a prefix. If you have abc as a token, you cannot have ace as a token
Note: The article's missing the | characters in its examples, just throw them in.

There is another YAML library for .NET which is under development. Right now it supports reading YAML streams and has been tested on Windows and Mono. Write support is currently being implemented.

Using a library is almost always preferably to rolling your own. Here's a quick list of "Oh I'll never need that/I didn't think about that" points which will end up coming to bite you later down the line:
Escaping characters. What if you want a : in the key or ] in the value?
Escaping the escape character.
Unicode
Mix of tabs and spaces (see the problems with Python's white space sensitive syntax)
Handling different return character formats
Handling syntax error reporting
Like others have suggested, YAML looks like your best bet.

You can also use a stack, and use a push/pop algorithm. This one matches open/closing tags.
public string check()
{
ArrayList tags = getTags();
int stackSize = tags.Count;
Stack stack = new Stack(stackSize);
foreach (string tag in tags)
{
if (!tag.Contains('/'))
{
stack.push(tag);
}
else
{
if (!stack.isEmpty())
{
string startTag = stack.pop();
startTag = startTag.Substring(1, startTag.Length - 1);
string endTag = tag.Substring(2, tag.Length - 2);
if (!startTag.Equals(endTag))
{
return "Fout: geen matchende eindtag";
}
}
else
{
return "Fout: geen matchende openeningstag";
}
}
}
if (!stack.isEmpty())
{
return "Fout: geen matchende eindtag";
}
return "Xml is valid";
}
You can probably adapt so you can read the contents of your file. Regular expressions are also a good idea.

It looks to me that you would be better off using an XML based config file as there are already .NET classes which can read and store the information for you relatively easily. Is there a reason that this is not possible?
#Bernard: It is true that hand editing XML is tedious, but the structure that you are presenting already looks very similar to XML.
Then yes, has a good method there.

#Gishu
Actually once I'd accommodated for escaped characters my regex ran slightly slower than my hand written top down recursive parser and that's without the nesting (linking sub-items to their parents) and error reporting the hand written parser had.
The regex was a slightly faster to write (though I do have a bit of experience with hand parsers) but that's without good error reporting. Once you add that it becomes slightly harder and longer to do.
I also find the hand written parser easier to understand the intention of. For instance, here is the a snippet of the code:
private static Node ParseNode(TextReader reader)
{
Node node = new Node();
int indentation = ParseWhitespace(reader);
Expect(reader, '[');
node.Key = ParseTerminatedString(reader, ':');
node.Value = ParseTerminatedString(reader, ']');
}

Regardless of the persisted format, using a Regex would be the fastest way of parsing.
In ruby it'd probably be a few lines of code.
\[KEY:(.*)\]
\[SUBKEY:(.*)\]
These two would get you the Value and SubValue in the first group. Check out MSDN on how to match a regex against a string.
This is something everyone should have in their kitty. Pre-Regex days would seem like the Ice Age.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to replace xml special chars manually? - c#

I think the solution to this question will help you: xslt replace \n with <br/> only in one node? You can incorporate the provided template into your transformation process, and you're done without getting your hands dirty.

Related

Parse XML with regards to one namespace only

What are the possible special characters which need to be handled in creating XML?

Reading XML file with Invalid character

Regex or XML Parser C#

Best method of Textfile Parsing in C#?

Categories

Resources