need to remove xml nodes in a string and leave the text - c#

i have the string which is a part of an xml.
a<b>b</b>c<i>d</i>e<b>f</b>g
the problem is that i want to extract from the string the parts that are not inside any tags.
so i need to extract the string"aceg" from this string and leave the characters "bdf"
how can this be done?
Edit:
this was a part of an xml
let asume its
<div>a<b>b</b>c<i>d</i>e<b>f</b>g</div>
now its a valid xml :)

The following regular expression will remove all tags from the string:
Regex.Replace("a<b>b</b>c<i>d</i>e<b>f</b>g", "<[^>]+>", string.Empty);

That string is not valid XML.
However, assuming you had a valid XML string, then you could do something like this:
class Program
{
static void Main(string[] args)
{
string contents = string.Empty;
XmlDocument document = new XmlDocument();
document.LoadXml("<outer>a<b>b</b>c<i>d</i>e<b>f</b>g</outer>");
foreach(XmlNode child in document.DocumentElement.ChildNodes)
{
if (child.NodeType == XmlNodeType.Element)
{
contents += child.InnerText;
}
}
Console.WriteLine(contents);
Console.ReadKey();
}
}
This will print out the string "bdf"

Following from #Stoo's answer you should be able to omit the tag contents as well with something like this:
Regex.Replace("a<b>b</b>c<i>d</i>e<b>f</b>g", "<[^>]+>[^<]+</[^>]+>", string.Empty);

Related

preserve &#xA, when reading XML

Xml content like following:
<xml>
<item content="abcd 
 abcd
abcd" />
</xml>
When using XmlDocument to read the content of content attribute, 
 and
are automatically escaped.
Code:
XmlDocument doc = new XmlDocument();
var content = doc.SelectSingleNode("/xml/item").Attributes["content"].Value;
How can get the raw text without char escaping?
If these characters were written to the lexical XML stream without escaping, then they would be swallowed by the XML parser when the stream is read by the recipient, as a result of the XML line-ending normalisation rules. So you've got it the wrong way around: the reason they are escaped is in order to preserve them; if they weren't escaped, they would be lost.
I got a workaround, it works for me:
private static string GetAttributeValue(XmlNode node, string attributeName)
{
if (node == null || string.IsNullOrWhiteSpace(attributeName))
{
throw new ArgumentException();
}
const string CharLF = "
";
const string CharCR = "
";
string xmlContent = node.OuterXml;
if (!xmlContent.Contains(CharLF) && !xmlContent.Contains(CharCR))
{
// no special char, return its original value directly
return node.Attributes[attributeName].Value;
}
string value = string.Empty;
if (xmlContent.Contains(attributeName))
{
value = xmlContent.Substring(xmlContent.IndexOf(attributeName)).Trim();
value = value.Substring(value.IndexOf("\"") + 1);
value = value.Substring(0, value.IndexOf("\""));
}
return value;
}

Looking for tags in strings from right to left

string sample1 = <SUCCESS><BUILDING>27</BUILDING></SUCCESS><CLEANED><LOCALITY>Value 1</LOCALITY></CLEANED>
string sample2 = <SUCCESS><BUILDING>14</BUILDING></SUCCESS> <SUCCESS><BUILDING>Value 2</BUILDING></SUCCESS>
In both above string samples I want to get the first "SUCCESS" tag from right to left.
So in sample 1 I want returned = <SUCCESS><BUILDING>27</BUILDING></SUCCESS>
and in sample 2 I want returned = <SUCCESS><BUILDING>Value 2</BUILDING></SUCCESS>
I know I can use Index of to first occurrence but not sure of last
XDocument doc = XDocument.Parse("<xml>" + sample2 + "</xml>");
Text = doc.Root.Elements("SUCCESS").Last().ToString();
c# has a nice String function called LastIndexOf(String). It will work the exact same way as indexOf(String) except give you the last occurrence.
http://msdn.microsoft.com/en-us/library/1wdsy8fy(v=vs.110).aspx
Hope this helps,
Cheers
If you're going to be parsing XML, you might be interested in using the XMLReader class. Read more about the XMLReader here.
Note that you need valid XML for the reader to work. In your example, you would need to wrap the partial XML in a unique root node (part of the XML spec). You might consider making some extension methods to help you:
public static class XMLStringExtensions
{
public static string LastTag(this string innerXml, string tag)
{
string previousTag = null;
using (var reader = XmlReader.Create(new StringReader(innerXml.WrapInRoot())))
while(reader.ReadToFollowing(tag)) previousTag = reader.ReadOuterXml();
return previousTag;
}
public static string WrapInRoot(this string partialXml)
{
return string.Format("<root>{0}</root>", partialXml);
}
}
Then you can invoke it like this:
sample1.LastTag("SUCCESS"); //<SUCCESS><BUILDING>27</BUILDING></SUCCESS>
sample2.LastTag("SUCCESS"); //<SUCCESS><BUILDING>Value 2</BUILDING></SUCCESS>

What would be the best way of checking whether a string contains XML tags?

I know that the following would find potential tags, but is there a better way to check if a string contains XML tags to prevent exceptions when reading/writing the string between XML files?
string testWord = "test<a>";
bool foundTag = Regex.IsMatch(testWord, #"^*<*>*$"));
I'd use another Regex for that
Regex.IsMatch(testWord, #"<.+?>");
However, even if it does match, there is no guarantee that your file actually is an xml file, as the regex could also match strings like "<<a>" which is invalid, or "a <= b >= c" which is obviously not xml.
You should consider using the XmlDocument class instead.
XmlDocument xmlDoc = new XmlDocument();
try
{
xmlDoc.Load(testWord);
}
catch
{
// not an xml
}
Why don't you HtmlEncode the string before sending it via XML? This way you can avoid difficulties with Regex parsing tags.

Loading an XML file when the tags are written in Greek doesn't work, why?

When I load XML files with English tags everything works fine but when I try to load an XML file with tags written in the Greek Language nothing works, why is this happening?
Do I have to change the encoding somewhere in the code?
This is the code I use:
XmlDocument xdoc = new XmlDocument();
xdoc.Load(filename);
XmlNode root = xdoc.DocumentElement;
if (root.HasChildNodes)
{
for (int i = 0; i < root.ChildNodes.Count; i++)
{
richTextBox1.AppendText(root.ChildNodes[i].InnerXml + "\n");
}
}
I downloaded your file and deserialized/displayed succesfully.
public class ΦΑΡΜΑΚΑ
{
public string A;
public string ΦΑΡΜ_ΑΓΩΓΗ;
public string ΧΟΡΗΓΗΣΗ;
public string ΛΗΞΗΣ;
public string ΑMKA;
}
XmlSerializer xml = new XmlSerializer(typeof(ΦΑΡΜΑΚΑ[]),new XmlRootAttribute("dataroot"));
ΦΑΡΜΑΚΑ[] array = (ΦΑΡΜΑΚΑ[])xml.Deserialize(File.Open(#"D:\Downloads\bio3.xml", FileMode.Open));
richTextBox1.Text = String.Join(Environment.NewLine, array.Select(x => x.ΦΑΡΜ_ΑΓΩΓΗ));
Make sure your rich text box has its multiline property set to true. Default is true, but you can may have changed it. Also, instead of \n use Environment.NewLine.
Also .InnerText will get you the value without the tags. InnerXml gives you the markup as well.

String escape into XML

Is there any C# function which could be used to escape and un-escape a string, which could be used to fill in the content of an XML element?
I am using VSTS 2008 + C# + .Net 3.0.
EDIT 1: I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand, for example, I need to put a<b into <foo></foo>, so I need escape string a<b and put it into element foo.
SecurityElement.Escape(string s)
public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}
public static string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}
EDIT: You say "I am concatenating simple and short XML file and I do not use serialization, so I need to explicitly escape XML character by hand".
I would strongly advise you not to do it by hand. Use the XML APIs to do it all for you - read in the original files, merge the two into a single document however you need to (you probably want to use XmlDocument.ImportNode), and then write it out again. You don't want to write your own XML parsers/formatters. Serialization is somewhat irrelevant here.
If you can give us a short but complete example of exactly what you're trying to do, we can probably help you to avoid having to worry about escaping in the first place.
Original answer
It's not entirely clear what you mean, but normally XML APIs do this for you. You set the text in a node, and it will automatically escape anything it needs to. For example:
LINQ to XML example:
using System;
using System.Xml.Linq;
class Test
{
static void Main()
{
XElement element = new XElement("tag",
"Brackets & stuff <>");
Console.WriteLine(element);
}
}
DOM example:
using System;
using System.Xml;
class Test
{
static void Main()
{
XmlDocument doc = new XmlDocument();
XmlElement element = doc.CreateElement("tag");
element.InnerText = "Brackets & stuff <>";
Console.WriteLine(element.OuterXml);
}
}
Output from both examples:
<tag>Brackets & stuff <></tag>
That's assuming you want XML escaping, of course. If you're not, please post more details.
Thanks to #sehe for the one-line escape:
var escaped = new System.Xml.Linq.XText(unescaped).ToString();
I add to it the one-line un-escape:
var unescapedAgain = System.Xml.XmlReader.Create(new StringReader("<r>" + escaped + "</r>")).ReadElementString();
George, it's simple. Always use the XML APIs to handle XML. They do all the escaping and unescaping for you.
Never create XML by appending strings.
And if you want, like me when I found this question, to escape XML node names, like for example when reading from an XML serialization, use the easiest way:
XmlConvert.EncodeName(string nameToEscape)
It will also escape spaces and any non-valid characters for XML elements.
http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape%28VS.80%29.aspx
Another take based on John Skeet's answer that doesn't return the tags:
void Main()
{
XmlString("Brackets & stuff <> and \"quotes\"").Dump();
}
public string XmlString(string text)
{
return new XElement("t", text).LastNode.ToString();
}
This returns just the value passed in, in XML encoded format:
Brackets & stuff <> and "quotes"
WARNING: Necromancing
Still Darin Dimitrov's answer + System.Security.SecurityElement.Escape(string s) isn't complete.
In XML 1.1, the simplest and safest way is to just encode EVERYTHING.
Like for \t.
It isn't supported at all in XML 1.0.
For XML 1.0, one possible workaround is to base-64 encode the text containing the character(s).
//string EncodedXml = SpecialXmlEscape("привет мир");
//Console.WriteLine(EncodedXml);
//string DecodedXml = XmlUnescape(EncodedXml);
//Console.WriteLine(DecodedXml);
public static string SpecialXmlEscape(string input)
{
//string content = System.Xml.XmlConvert.EncodeName("\t");
//string content = System.Security.SecurityElement.Escape("\t");
//string strDelimiter = System.Web.HttpUtility.HtmlEncode("\t"); // XmlEscape("\t"); //XmlDecode(" ");
//strDelimiter = XmlUnescape(";");
//Console.WriteLine(strDelimiter);
//Console.WriteLine(string.Format("&#{0};", (int)';'));
//Console.WriteLine(System.Text.Encoding.ASCII.HeaderName);
//Console.WriteLine(System.Text.Encoding.UTF8.HeaderName);
string strXmlText = "";
if (string.IsNullOrEmpty(input))
return input;
System.Text.StringBuilder sb = new StringBuilder();
for (int i = 0; i < input.Length; ++i)
{
sb.AppendFormat("&#{0};", (int)input[i]);
}
strXmlText = sb.ToString();
sb.Clear();
sb = null;
return strXmlText;
} // End Function SpecialXmlEscape
XML 1.0:
public static string Base64Encode(string plainText)
{
var plainTextBytes = System.Text.Encoding.UTF8.GetBytes(plainText);
return System.Convert.ToBase64String(plainTextBytes);
}
public static string Base64Decode(string base64EncodedData)
{
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.UTF8.GetString(base64EncodedBytes);
}
Following functions will do the work. Didn't test against XmlDocument, but I guess this is much faster.
public static string XmlEncode(string value)
{
System.Xml.XmlWriterSettings settings = new System.Xml.XmlWriterSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
StringBuilder builder = new StringBuilder();
using (var writer = System.Xml.XmlWriter.Create(builder, settings))
{
writer.WriteString(value);
}
return builder.ToString();
}
public static string XmlDecode(string xmlEncodedValue)
{
System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings
{
ConformanceLevel = System.Xml.ConformanceLevel.Fragment
};
using (var stringReader = new System.IO.StringReader(xmlEncodedValue))
{
using (var xmlReader = System.Xml.XmlReader.Create(stringReader, settings))
{
xmlReader.Read();
return xmlReader.Value;
}
}
}
Using a third-party library (Newtonsoft.Json) as alternative:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped); ;
}
public static string XmlUnescape(string escaped)
{
if (escaped == null) return null;
return JsonConvert.DeserializeObject(escaped, typeof(string)).ToString();
}
Examples of escaped string:
a<b ==> "a<b"
<foo></foo> ==> "foo></foo>"
NOTE:
In newer versions, the code written above may not work with escaping, so you need to specify how the strings will be escaped:
public static string XmlEscape(string unescaped)
{
if (unescaped == null) return null;
return JsonConvert.SerializeObject(unescaped, new JsonSerializerSettings()
{
StringEscapeHandling = StringEscapeHandling.EscapeHtml
});
}
Examples of escaped string:
a<b ==> "a\u003cb"
<foo></foo> ==> "\u003cfoo\u003e\u003c/foo\u003e"
SecurityElementEscape does this job for you
Use this method to replace invalid characters in a string before using the string in a SecurityElement. If invalid characters are used in a SecurityElement without being escaped, an ArgumentException is thrown.
The following table shows the invalid XML characters and their escaped equivalents.
https://learn.microsoft.com/en-us/dotnet/api/system.security.securityelement.escape?view=net-5.0

Categories

Resources