When writing an XML documentation comment in a C# source file, do I need to replace " with "? This question discusses how to replace characters, but does not establish which characters need replacing, aside from establishing that angle brackets do need replacing.
No, just like normal XML, you only need to escape the quotes when they would otherwise have a special meaning. So if your XML documentation contains an attribute and you want a double quote in the attribute value, then you'd either use an attribute or use a single quote for the value start/end:
/// Foo <element attr="Bar"Baz" />
/// Foo <element attr='Bar"Baz' />
But it's very rare to need attribute values with quotes in within XML documentation, in my experience. They're almost always references to parameters, members, or list types.
Related
I want quick function which may be part of my xml parser, I do not want to parse whole string and check if it correct xml.
This is not really doable without parsing, or at least—in a limited form—without using a regular expression. Names in XML permit different characters as the first character and as second and further characters — see the Name production.
Should you implement IsValidXmlChar without a context, i.e. just checking if the given character is a NameChar, as per the XML specification, the output of your example would be GridAttributeStuff.
So you should at least tokenize the input text to retrieve valid names, and parse the input to retrieve element names, i.e. output Grid in your example.
To check if a string is a XML name, the XmlReader class offers the IsName static method. To categorize characters in an XML text, there is the XmlCharType struct in .NET Framework as well as in .NET Core, but it's internal.
I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.
The following...
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>"
prints out
<b>+
<inner1 /><inner2 /></b>
while this...
var badNode = XElement.Parse(#"<b>
<inner1/>
<inner2/>
</b>"
gives the expected
<b>
<inner1 />
<inner2 />
</b>
According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.
Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?
You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:
3.2.2 Mixed Content
[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]
The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:
This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.
The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.
This explains the behavior you are seeing.
As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:
var badNode = XElement.Parse(#"<b>+
<inner1/>
<inner2/>
</b>",
LoadOptions.PreserveWhitespace);
Console.WriteLine(badNode);
Which outputs:
<b>+
<inner1 />
<inner2 />
</b>
Demo fiddle #1 here.
Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:
badNode.Nodes().OfType<XText>().Remove();
Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.
Demo fiddle #2 here.
I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>
I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
#"<.*></*>
I need some sort of regex that makes sure the pattern of the two * are the same.
So:
<Item><One>1</One><Two></Two><Three>3</Three></Item>
Would change into:
<Item><One>1</One><Three>3</Three></Item>
So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.
I don't have access to the original data that would allow recreating valid xml.
You want to capture one or more word characters inside <...>and match the closing tag by using \1 backreference to what was captured by first group.
<(\w+)></\1>
See demo at regex101
AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.
Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).
What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):
<\w+><\/\w+>
You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.
Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).
Use XML Linq
string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
XElement item = XElement.Parse(xml);
item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));
I'm populating an XElement with information and writing it to an xml file using the XElement.Save(path) method. At some point, certain characters in the resulting file are being escaped - for example, > becomes >.
This behaviour is unacceptable, since I need to store information in the XML that includes the > character as part of a password. How can I write the 'raw' content of my XElement object to XML without having these escaped?
Lack of this behavior is unacceptable.
A standalone unescaped > is invalid XML.
XElement is designed to produce valid XML.
If you want to get the unescaped content of the element, use the Value property.
The XML specification usually allows > to appear unescaped. XDocument plays it safe and escapes it although it appears in places where the escaping is not strictly required.
You can do a replace on the generated XML. Be aware per http://www.w3.org/TR/REC-xml#syntax, if this results in any ]]> sequences, the XML will not conform to the XML specification. Moreover, XDocument.Parse will actually reject such XML with the error "']]>' is not allowed in character data.".
XDocument doc = XDocument.Parse("<test>Test>Data</test>");
// Don't use this if it could result in any ]]> sequences!
string s = doc.ToString().Replace(">", ">");
System.IO.File.WriteAllText(#"c:\path\test.xml", s);
In consideration that any spec-compliant XML parser must support >, I'd highly recommend fixing the code that is processing the XML output of your program.
I use classes (autogenerated from a schema) to generate xml documents. It has worked fine, until now, when I need to use inline HTML elements. I've tried several different methods, but as soon as I use the inline HTML, the "<" and ">" gets replaced with %lt; etc.
Example:
<meta>
<name>test</name>
<value>test <br />new row</value>
</meta>
becomes "destroyed" later on when trying to get it as a string for database storage, the value is changed to:
<value>test <br />new row</value>
How is it possible to keep the angle brackets intact?
You need to use CDATA sections for XML (or XML like) content.
The XML write is escaping the reserved characters such as <, > etc. If you're reading the text back using a Xml reader then your < will be correctly read as <.