Using C# Regular expression to replace XML element content

Using C# Regular expression to replace XML element content - c#

I'm writing some code that handles logging xml data and I would like to be able to replace the content of certain elements (eg passwords) in the document. I'd rather not serialize and parse the document as my code will be handling a variety of schemas.
Sample input documents:
doc #1:
<user>
<userid>jsmith</userid>
<password>myPword</password>
</user>
doc #2:
<secinfo>
<ns:username>jsmith</ns:username>
<ns:password>myPword</ns:password>
</secinfo>
What I'd like my output to be:
output doc #1:
<user>
<userid>jsmith</userid>
<password>XXXXX</password>
</user>
output doc #2:
<secinfo>
<ns:username>jsmith</ns:username>
<ns:password>XXXXX</ns:password>
</secinfo>
Since the documents I'll be processing could have a variety of schemas, I was hoping to come up with a nice generic regular expression solution that could find elements with password in them and mask the content accordingly.
Can I solve this using regular expressions and C# or is there a more efficient way?

This problem is best solved with XSLT:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//password">
<xsl:copy>
<xsl:text>XXXXX</xsl:text>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This will work for both inputs as long as you handle the namespaces properly.
Edit : Clarification of what I mean by "handle namespaces properly"
Make sure your source document that has the ns name prefix has as namespace defined for the document like so:
<?xml version="1.0" encoding="utf-8"?>
<secinfo xmlns:ns="urn:foo">
<ns:username>jsmith</ns:username>
<ns:password>XXXXX</ns:password>
</secinfo>

I'd say you're better off parsing the content with a .NET XmlDocument object and finding password elements using XPath, then changing their innerXML properties. It has the advantage of being more correct (since XML isn't regular in the first place), and it's conceptually easy to understand.

From experience with systems that try to parse and/or modify XML without proper parsers, let me say: DON'T DO IT. Use an XML parser (There are other answers here that have ways to do that quickly and easily).
Using non-xml methods to parse and/or modify an XML stream will ALWAYS lead you to pain at some point in the future. I know, because I have felt that pain.
I know that it seems like it would be quicker-at-runtime/simpler-to-code/easier-to-understand/whatever if you use the regex solution. But you're just going to make someone's life miserable later.

You can use regular expressions if you know enough about what you are trying to match. For example if you are looking for any tag that has the word "password" in it with no inner tags this regex expression would work:
(<([^>]*?password[^>]*?)>)([^<]*?)(<\/\2>)
You could use the same C# replace statement in zowat's answer as well but for the replace string you would want to use "$1XXXXX$4" instead.

Regex is the wrong approach for this, I've seen it go so badly wrong when you least expect it.
XDocument is way more fun anyway:
XDocument doc = XDocument.Parse(#"
<user>
<userid>jsmith</userid>
<password>password</password>
</user>");
doc.Element("user").Element("password").Value = "XXXX";
// Temp namespace just for the purposes of the example -
XDocument doc2 = XDocument.Parse(#"
<secinfo xmlns:ns='http://tempuru.org/users'>
<ns:userid>jsmith</ns:userid>
<ns:password>password</ns:password>
</secinfo>");
doc2.Element("secinfo").Element("{http://tempuru.org/users}password").Value = "XXXXX";

Here is what I came up with when I went with XMLDocument, it may not be as slick as XSLT, but should be generic enough to handle a variety of documents:
//input is a String with some valid XML
XmlDocument doc = new XmlDocument();
doc.LoadXml(input);
XmlNodeList nodeList = doc.SelectNodes("//*");
foreach (XmlNode node in nodeList)
{
if (node.Name.ToUpper().Contains("PASSWORD"))
{
node.InnerText = "XXXX";
}
else if (node.Attributes.Count > 0)
{
foreach (XmlAttribute a in node.Attributes)
{
if (a.LocalName.ToUpper().Contains("PASSWORD"))
{
a.InnerText = "XXXXX";
}
}
}
}

The main reason that XSLT exist is to be able to transform XML-structures, this means that an XSLT is a type of stylesheet that can be used to alter the order of elements och change content of elements. Therefore this is a typical situation where it´s highly recommended to use XSLT instead of parsing as Andrew Hare said in a previous post.

Related

xpath to return value from xml doc

I'm wondering if there is a way to do the following with one xpath expression:
I have an XML doc similar to this but with many 'results',
<result>
<id>1</id>
<name>joe</name>
</result>
<result>
<id>2</id>
<name>jim</name>
</result>
I'm passing a variable into a C# utility along with the xml, and want to return the name where the id = the variable.
I could loop through the xml until reach what I'm after but if there's a handy xpath way to do it I'm listening...
thanks

Assuming you have a root element in there like "results" that XPath can validate, and that you don't have any other nodes named "result"...
//result[id=1]/name
Or you could get the text outright, instead of it being returned in a node
//result[id=1]/name/text()
And if you want to make sure that there's only one result, you could surround it with parens and put a [1] after
(//result[id=1]/name/text())[1]
I would also recommend testing with one of the xpath test sites out there like this one, but beware that different xpath/xml parsers sometimes behave differently.

xpath return string instead of nodelist

I am working on a biztalk project and I need to copy (filtered) content from 1 xml to another.
I have to do this with xpath, I can't use xsl transformation.
So my xpath to get the content from the source xml file is this:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*
Now this returns an xmlNodelist. Is it possible to return a string with all the nodes in it like:
"<root><node>text</node></root>"
If I put string() before my xpath it returns the values, but I want the whole xml in a string (with nodes..), so I could load that string in another xmldocument. I think this is the best method for my problem.
I know I can loop over the xmlnodelist and append the nodes to the new xmldocument, but it's a bit tricky to loop in a biztalk orchestration and I want to avoid this.
The code I can use is C#.
I've tried to just assign the nodelist to the xmldocument, but this throws a cast error (obvious..).
The way I see it is that I have 2 solutions:
assign the nodelist to the xmldocument without a loop (not possible i think in C#)
somehow convert the nodelist to string and load this in the xmldocument
load the xpath directly in the new xmldocument (don't know if this is possible since it returns a nodelist)
Thanks for your help
edit:
sample input:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance
<insurance>
<id>2</id>
<billing></billing>
</insurance>
<insurance>
<id>3</id>
<billing></billing>
</insurance>
</root>
Now I want to copy this sample to another xmldocument, but without insurance node 2 and 3 (this is dynamically, so it could be unsurance node 1 and 2 to delete, or 1 and 3...)
So this has to be the output:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance>
</root>
What I am doing now is use the xpath to get the nodes I want. Then I want to assign the result to the new xmldocument, but this is not possible since I get the castException
string xpath = "//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*";
xmlDoc = new System.Xml.XmlDocument();
xmlDoc = xpath(sourceXml, strXpath); <= cast error (cannot cast xmlnodelist to xmldocuemnt)
I know the syntax is a bit strange, but it is biztalk c# code..

The most straightforward solution would indeed be to "loop over the xmlnodelist and append (import) the nodes to the new xmldocument", but since you can't loop, what other basic things can/can't you do?
To serialize the nodelist, you could try using XmlNodeList.toString(). If that worked, you'd get a strange beast, because it could be duplicating parts of the XML document several times over. Especially since you're explicitly including ancestors and descendants directly in the nodelist. It would not be something that you could parse back in and have a result that resembled the nodelist you started with.
In other words, it would be best to loop over the XmlNodeList and import the nodes to the new XmlDocument.
But even so, I would be really surprised if you wanted to put all these ancestor and descendant nodes:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::
directly into the new XML document. If you post some sample input and the desired output, we can probably help determine if that's the case.
Update:
I see what you're trying to do: copy an XML document, omitting all <insurance> elements (and their descendants) except the one you want.
This can be done without a loop if the output is as simple as your sample output: only one <Patient> and one <insurance> element, with their descendants, under one top-level element.
Something like (I can't test this as I don't have a biztalk server):
string xpathPatient = "/*/Patient";
string xpathInsuran = "/*/insurance[id = " + insId + "]"; // insId is a parameter
xmlDoc = new System.Xml.XmlDocument();
xmlPatient = xpath(sourceXml, xpathPatient);
xmlInsuran = xpath(sourceXml, xpathInsuran);
XmlElement rootNode = xmlDoc.CreateElement("root");
xmlDoc.AppendChild(rootNode);
//**Update: use [0] to get an XmlNode from the returned XmlNodeList (presumably)
rootNode.AppendChild(xmlDoc.ImportNode(xmlPatient[0], true));
rootNode.AppendChild(xmlDoc.ImportNode(xmlInsuran[0], true));
I confess though, I'm curious why you can't use XSLT. You're approaching tasks that would be more easily done in XSLT than in XPath + C# XmlDocument.
Update: since the xpath() function probably returns an XmlNodeList rather than an XmlNode, I added [0] to the first argument to ImportNode() above. Thanks to #Martin Honnen for alerting me to that.

XPath is a query language (only) for XML documents.
It operates on an abstract model -- the XML INFOSET, and cannot either modify the structure of the XML document(s) it operates on or serialize the INFOSET information items back to XML.
Therefore, the only way to achieve such serialization is to use the language that is hosting XPath.
Apart from this, there are obvious problems with yout question, for example these is no element named IN1_Insurance in the provided XML document -- therefore the XPath expression provided:
//*[not(ancestor-or-self::IN1_Insurance)]|//IN1_Insurance[2]/descendant-or-self::*
selects all elements in the document.
Note:
The described task is elementary to fulfil using XSLT.
Finally: If you are allowed to use C# then you can use the XslCompiledTransform (or XslTransform) class. Use its Transform() method to carry out the following transformation against the XML document:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="insurance[not(id=1)]"/>
</xsl:stylesheet>
This produces exactly the wanted result:
<root>
<Patient>
<PatientId></PatientId>
<name></name>
</Patient>
<insurance>
<id>1</id>
<billing></billing>
</insurance>
</root>

Force XML character entities into XmlDocument

I have some XML that looks like this:
<abc x="{"></abc>
I want to force XmlDocument to use the XML character entities of the brackets, ie:
<abc x="{"></abc>
MSDN says this:
In order to assign an attribute value
that contains entity references, the
user must create an XmlAttribute node
plus any XmlText and
XmlEntityReference nodes, build the
appropriate subtree and use
SetAttributeNode to assign it as the
value of an attribute.
CreateEntityReference sounded promising, so I tried this:
XmlDocument doc = new XmlDocument();
doc.LoadXml("<abc />");
XmlAttribute x = doc.CreateAttribute("x");
x.AppendChild(doc.CreateEntityReference("#123"));
doc.DocumentElement.Attributes.Append(x);
And I get the exception Cannot create an 'EntityReference' node with a name starting with '#'.
Any reason why CreateEntityReference doesn't like the '#' - and more importantly how can I get the character entity into XmlDocument's XML? Is it even possible? I'm hoping to avoid string manipulation of the OuterXml...

You're mostly out of luck.
First off, what you're dealing with are called Character References, which is why CreateEntityReference fails. The sole reason for a character reference to exist is to provide access to characters that would be illegal in a given context or otherwise difficult to create.
Definition: A character reference
refers to a specific character in the
ISO/IEC 10646 character set, for
example one not directly accessible
from available input devices.
(See section 4.1 of the XML spec)
When an XML processor encounters a character reference, if it is referenced in the value of an attribute (that is, if the &#xxx format is used inside an attribute), it is set to "Included" which means its value is looked up and the text is replaced.
The string "AT&T;" expands to "
AT&T;" and the remaining ampersand is
not recognized as an entity-reference
delimiter
(See section 4.4 of the XML spec)
This is baked into the XML spec and the Microsoft XML stack is doing what it's required to do: process character references.
The best I can see you doing is to take a peek at these old XML.com articles, one of which uses XSL to disable output escaping so &#123; would turn into { in the output.
http://www.xml.com/pub/a/2001/03/14/trxml10.html
<!DOCTYPE stylesheet [
<!ENTITY ntilde
"<xsl:text disable-output-escaping='yes'>&ntilde;</xsl:text>">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test">
<testOut>
The Spanish word for "Spain" is "España".
<xsl:apply-templates/>
</testOut>
</xsl:template>
</xsl:stylesheet>
And this one which uses XSL to convert specific character references into other text sequences (to accomplish the same goal as the previous link).
http://www.xml.com/lpt/a/1426
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output use-character-maps="cm1"/>
<xsl:character-map name="cm1">
<xsl:output-character character=" " string="&nbsp;"/>
<xsl:output-character character="é" string="&233;"/> <!-- é -->
<xsl:output-character character="ô" string="&#244;"/>
<xsl:output-character character="—" string="--"/>
</xsl:character-map>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

You should always manipulate your strings with the preceding # like so #"My /?.,<> STRING". I don't know if that will solve your issue though.
I would approach the problem using XmlNode class from the XmlDocument. You can use the Attributes property and it'll be way easier. Check it out here:
http://msdn.microsoft.com/en-us/library/system.xml.xmlnode.attributes.aspx

How to embed xml in xml

I need to embed an entire well-formed xml document within another xml document. However, I would rather avoid CDATA (personal distaste) and also I would like to avoid the parser that will receive the whole document from wasting time parsing the embedded xml. The embedded xml could be quite significant, and I would like the code that will receive the whole file to treat the embedded xml as arbitrary data.
The idea that immediately came to mind is to encode the embedded xml in base64, or to zip it. Does this sound ok?
I'm coding in C# by the way.

You could convert the XML to a byte array, then convert it to binary64 format. That will allow you to nest it in an element, and not have to use CDATA.

The W3C-approved way of doing this is XInclude. There is an implementation for .Net at http://mvp-xml.sourceforge.net/xinclude/

Just a quick note, I have gone the base64 route and it works just fine but it does come with a stiff performance penalty, especially under heavy usage. We do this with document fragments upto 20MB and after base64 encoding they can take upwards of 65MB (with tags and data), even with zipping.
However, the bigger issue is that .NET base64 encoding can consume up-to 10x the memory when performing the encoding/decoding and can frequently cause OOM exceptions if done repeatedly and/or done on multiple threads.
Someone, on a similar question recommended ProtoBuf as an option, as well as Fast InfoSet as another option.

Depending on how you construct the XML, one way is to not care about it and let the framework handle it.
XmlDocument doc = new XmlDocument();
doc.LoadXml("<?xml version=\"1.0\" encoding=\"utf-8\" ?><helloworld></helloworld>");
string xml = "<how><are><you reply=\"i am fine\">really</you></are></how>";
doc.GetElementsByTagName("helloworld")[0].InnerText = xml;
The output will be something like a HTMLEncoded string:
<?xml version="1.0" encoding="utf-8"?>
<helloworld><how><are><you
reply="i am fine">really</you></are></how>
</helloworld>

I would encode it in your favorite way (e.g. base64 or HttpServerUtility::UrlEncode, ...) and then embed it.

If you don't need the xml declaration (first line of the document), just insert the root element (with all childs) into the tree of the other xml document as a child of an existing element. Use a different namespace to seperate the inserted elements.

It seems that serialization is the recommended method.

Can't you use XSLT for this? Perhaps using xsl:copy or xsl:copy-of? This is what XSLT is for.

I use Comments for this :
<!-- your xml text -->
[EDITED]
If the embedded xml with comments, replace it with a different syntax.
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status code="0" msg="" cause="" />
<data>
<order type="07" user="none" attrib="..." >
<xmlembeded >
<!--
<?xml version="1.0" encoding="iso-8859-1" ?>
<xml>
<status ret="000 "/>
<data>
<allxml_here />
<!** embedeb comments **>
</data>
<xml>
-->
</xmlembeded >
</order>
<context sessionid="12345678" scriptname="/from/..." attrib="..." />
</data>
</xml>

C# XmlDocument SelectNodes

I have an xml document with a root element, two child elements, 'diagnostic' and 'results'. The 'results' element then has an arbitrary number of elements with the name 'result'
When this is loaded into an XmlDocument it is easy to navigate the structure and see that this is exactly how things operate. I can write a recursive function that picks out all the "result" elements. The XmlDocument.SelectNodes("//results") finds a node no problem.
However,
* XmlDocument.SelectNodes("//results/result") finds nothing.
* XmlDocument.SelectNodes("//result") finds nothing.
I've talked to a co-worker and he's had grief using Xpath in XmlDocument.SelectNodes. Anyone else run into this kind of problem? Any solutions?
XML FILE:
<?xml version="1.0" encoding="UTF-8"?>
<query xmlns:yahoo="http://www.yahooapis.com/v1/base.rng" yahoo:count="10" yahoo:created="2009-08-07T10:19:59Z" yahoo:lang="en-US" yahoo:updated="2009-08-07T10:19:59Z" yahoo:uri="http://query.yahooapis.com/v1/yql?q=select+*+from+search.news+where+query%3D%22Tanzania%22">
<diagnostics>
<publiclyCallable>true</publiclyCallable>
<url execution-time="47"><![CDATA[http://boss.yahooapis.com/ysearch/news/v1/Tanzania?format=xml&start=0&count=10]]></url>
<user-time>49</user-time>
<service-time>47</service-time>
<build-version>2579</build-version>
</diagnostics>
<results>
<result xmlns="http://www.inktomi.com/">
<abstract>Kakungulu Cup winners SC Villa face Tanzania’s Simba SC this afternoon at the National stadium in Dar es salaam. “We had a very tiresome journey. The road was so bad and the road blocks were so many. However, we finally reached but the boys were so tired,” said Kato.</abstract>
<clickurl>http://lrd.yahooapis.com/_ylc=X3oDMTQ4cXAxcnRoBF9TAzIwMjMxNTI3MDIEYXBwaWQDb0pfTWdwbklrWW5CMWhTZnFUZEd5TkouTXNxZlNMQmkEY2xpZW50A2Jvc3MEc2VydmljZQNCT1NTBHNsawN0aXRsZQRzcmNwdmlkA21VVGlta2dlQXUzeEYuM0xGQkQzR1pUU1FIS0dORXA4cUk4QUJJX1U-/SIG=12vhpskdd/**http%3A//www.monitor.co.ug/artman/publish/sports/SC_Villa_face_Simba_in_Tanzania_89289.shtml</clickurl>
<date>2009/08/07</date>
<language>english</language>
<source>The Monitor</source>
<sourceurl>http://www.monitor.co.ug/</sourceurl>
<time>20:22:32</time>
<title>SC Villa face Simba in Tanzania</title>
<url>http://www.monitor.co.ug/artman/publish/sports/SC_Villa_face_Simba_in_Tanzania_89289.shtml</url>
</result>
XPATH
doc.SelectNodes("//result") produces no hits.

Rob and Marc's answers are probably going in the right direction - XmlDocument + namespaces + XPath can be a bit of a pain.
If you're able to use .NET 3.5, I suggest you use LINQ to XML instead. That would make it really easy:
XDocument doc = XDocument.Load("foo.xml");
XNamespace ns = "bar";
var results = doc.Descendants(ns + "result");
foreach (var result in results)
{
...
}
Basically LINQ to XML is a superior API in almost every way, in my experience :) (I believe there are some capabilities it's missing, but if you have access to .NET 3.5 it's definitely worth at least trying.)

It sounds to me like namespaces are the issues; you generally need to enlist the help of an XmlNamespaceManager for this, and use an alias in your queries, i.e.
doc.SelectNodes("//x:results/x:result", nsmgr);
(where x is defined in nsmgr as an alias to the given namespace)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Using C# Regular expression to replace XML element content - c#

I'd say you're better off parsing the content with a .NET XmlDocument object and finding password elements using XPath, then changing their innerXML properties. It has the advantage of being more correct (since XML isn't regular in the first place), and it's conceptually easy to understand.

Related

xpath to return value from xml doc

xpath return string instead of nodelist

Force XML character entities into XmlDocument

How to embed xml in xml

C# XmlDocument SelectNodes

Categories

Resources