& in XElement - c#

I want to generate <element>&</element> using System.Xml.Linq.XElement.
I tried this: new XElement("element", "&") but it escapes the ampersand and it generates: <element>&amp;</element>
The only workaround I can think of is to create a custom class that inherits from XText, override the WriteTo method and use the XmlWriter => WriteEntityRef method. It seems to me that this is a bit of an overkill. Is there another way of doing this?

You've got the solution in comment already, just some background:
You must understand the difference between the content as is (outside the XML) and its representation within the XML. Outside of the XML you see the plain content, when it is written into the XML it is escaped automatically, on reading it is re-escaped again.
I tried this: new XElement("element", "&") but it escapes the
ampersand and it generates: <element>&amp;</element>
This shows clearly, what's going on. By passing in & you get &amp;. The engine sees the & and replaces it with the entity.
Just use new XElement("element", "&") which should get you the result needed.

As suggested in the comments and in the answer new XElement("element", "&") will actually leverage on the framework to escape the ampersand correctly.
Part of the issue I had with this was the fact that initially I was trying to put in my xml element with no much success. I was unaware that unless my XML has a DTD which defines , I can't use .
Since then, I updated my question to use & instead and this in turn changed the behaviour of XElement because unlike a space, & needs escaping in XML, and LINQ to XML does this automatically (as expected).

Related

How to write '&' in xml?

I am using xmlTextWriter to create the xml.
writer.WriteStartElement("book");
writer.WriteAttributeString("author", "j.k.rowling");
writer.WriteAttributeString("year", "1990");
writer.WriteString("&");
writer.WriteEndElement();
But now i need to write '&' but xmlTextWriter will automatically write this one as "&amp";
So is there any work around to do this?
I am creating xml by reading the doc file.So if I read "-" then in xml i need to write "&ndash";.So while writing it's written as "&amp";ndash.
So, for example, if I am trying to write a node with the text good-bad, I actually need to write my XML such as <node>good–bad</node>. This is a requirement of my project.
In a proper XML file, you cannot have a standalone & character unless it is an escape character. So if you need an XML node to contain good–bad, then it will have to be encoded as good&ndash;bad. There is no workaround as anything different would not be valid XML. The only way to make it work is to just write the XML file as a plain text how you want it, but then it could not be read by an XML parser as it is not proper XML.
Here's a code example of my suggested workaround (you didn't specify a language, so I am showing you in C#, but Java should have something similar):
using(var sw = new StreamWriter(stream))
{
// other code to write XML-like data
sw.WriteLine("<node>good–bad</node>");
// other code to write XML-like data
}
As you discovered, another option is to use the WriteRaw() method on XmlTextWriter (in C#) will write an unencoded string, but it does not change the fact it is not going to be a valid XML file when it is done.
But as I mentioned, if you tried to read this with an XML Parser, it would fail because &ndash is not a valid XML character entity so it is not valid XML.
– is an HTML character entity, so escaping it in an XML should not normally be necessary.
In the XML language, & is the escape character, so & is appropriate string representation of &. You cannot use just a & character because the & character has a special meaning and therefore a single & character would be misinterpreted by the parser/
You will see similar behavior with the <, >, ", and' characters. All have meaning within the XML language so if you need to represent them as text in a document.
Here's a reference to all of the character entities in XML (and HTML) from Wikipedia. Each will always be represented by the escape character and the name (>, <, ", &apos;)
In XML, & must be escaped as &. The & character is reserved for entities and thus not allowed otherwise. Entities are used to escape characters with special meanings in XML.
Another software reading the XML has to decode the entity again. < for < and > for > or other examples, some other languages like HTML which are based on XML provide even more of these.
I think you will need to encode it. Like so:
colTest = "&"
writer.WriteEncodedText(colTest)

Remove strange hidden charecters from my JSON before deserializing

I have some JSON being sent to me that breaks when it is trying to be deserialized. It seems to contain a black diamond with a ? in it. I cannot see the character but it is obviously there and it is failing on my system.
How do I get rid of this and still leave my JSON intact for deserialization?
UPDATE:
Here is a example of what will be in the middle of my JSON:
"UDF5" : "�65",
I am even open to just removing this property from my JSON altogether via RegEx.
As answered for: remove piece of string (JSON string ) with regex and based on the formatting you provide in that question (and I am assuming will edit into this one):
Assuming I can rely on the formatting you show above and it is one of these per regex being run this can be accomplished as simply as something like
([\S\s]*\"])\"UDF5\" : \"[\S\s]*?\",([\S\s]*)
Using the back reference $1$2 referencing the parts before and after the UDF5 field to write back out.
If there is a newline there to remove I am not doing it right now. This could be better - if someone else has time to correct or provide an additional answer. But in the interests of getting you an emergency fix I hope this helps.

RSS Feeds replacing characters?

I am currently creating a RSS feed linked to a custom built news column. The news column uses a series of query strings in order to direct the user to a specific post or posts! However the problem I am facing is that the rss feed is replacing some of these query strings with random numbers. For instance:
http://www.correlatesearch.com/news.aspx?cat=BusinessManagementControls&nw=
&nw= is being replace with
&
Can anyone direct to a way around this??
Many thanks!
My guess is that you're looking at the raw RSS - which is XML. Within XML, & has to be escaped as &. This is far from "random numbers".
I suspect you'll find that &nw= is actually being escaped to &nw= - in which case it's not actually changing your content at all. It's representing the text of your URL in an XML-appropriate way. When the XML is read by a client, it will (or should) understand it appropriately.
Is the feed an XML document? Then the replacement should take place. It is called escaping character entities.
And I don't see any "random numbers" that you referred to...

How to prevent illegal characters to appear in my XML when retrieving it from SQL Server

Sometimes the string values of Properties in my Classes become odd. They contain illegal characters and are displayed like this (with boxes):
123[]45[]6789
I'm assuming those are illegal/unrecognized characters. I serialize all my objects to XML and then upload them via Web Service. When I retrieve them again, some characters are replaced with oddities. This happens most often with hyphens and dashes that have been typed using Word. Is that the cause of it?
Is there anyway I can check to see if the string contains any of these unrecognized characters via regex or something?
The first thing to remember, is that there is no such thing as a "special character" or an "illegal character". There are characters that are special in certain circumstances, there are non-characters, but there are no generally "special characters" or "illegal characters".
What you have here is either:
Perfectly normal characters for which your font doesn't have a glyph.
Perfectly normal characters that aren't printable (e.g. control characters).
An artefact of how the debugger works.
The first thing is to find out what that character is. Find the integer value of the character, and then look it up.
An important one to look out for is U+FFFD (�) as it is sometimes used when a decoder has recieved a bunch of bytes that make no sense in the context of the encoding it is trying to use (e.g. 0x80 followed by 0x20 makes no sense in UTF-8, and one possible response is to use U+FFFD as a "something strange here" marker, other possible responses are throwing an error, and also silently ignoring the error or trying to guess at intent though those last two bring security issues).
Once you've this figured out, you can begin to reason about why it's getting in there if it isn't expected. Could it be an ecoding issue (charset written in is not the charset read in)? Could it be actually intended to be there? Could it be something else? You can't begin to answer that until you have more information on the bug.
Finally, there's the matter of what to do about it. This will hopefully be obvious from the answers you've found in your research above. Possibly the answer will be "nothing it's fine", possibly something simple or something hard. Can't say yet.
Do not just filter with a regular expression. Maybe that will turn out to be the correct solution, but you don't know yet, so maybe you're making a deeper bug harder to find than it is now, or damaging perfectly good data.
Personally I don't think using a Regex to check for these characters is the correct solution. If you aren't storing those characters then there is obviously some sort of encoding issue.
Verify that the XML document itself is stored using the correct encoding to support the characters you need to store. Then verify when you are reading the file in that you are using the same encoding as the document i.e. if your XML document is stored as UTF-8 then you need to make sure when you read it in your encoding it as UTF-8.
Take a deeper look at the characters themselves, what are the acutal char values?
When a character shows up an a square it means you can't represent it visually. This is either because it's a non-visual character, or it's outside of your current char set.
edit, nope
In your example I'd venture a guess that your seeing imbedded newline characters.
Define the allowed characters and block everything else, i.e.:
// only lowercase letters and digits
if(Regex.IsMatch(yourString, #"^[a-z0-9]*$"))
{
// allowed
}
But I think your problem may lie somewhere else, because you say it comes from serializing (valid) string and then deserializing (invalid) strings. It is possibly that you use default serialization and that you don't apply proper ISerializable implementation for your classes (or proper use of the Serializable attributes), resulting in properties or fields being serialized that you don't want to be serialized.
PS: others have mentioned encoding issues, which is a possible cause and might mean you cannot read back the data at all. About encoding there's one simple rule: use the same encoding everywhere (streams, database, xml) and be specific. If you are not, the default encoding is used, which can be different from system to system.
Edit: possible solution
Based on new information (see thread under original question), it is pretty clear that the issue has to do with encoding. The OP mentions that it appears with dashes, which are often replaced with pretty dashes like "—" (—) when used in some fancy editing environment. Since it seems that there's some unclarity in how to fix SQL Server to accept proper encoded strings, you can also solve this in your XML.
When you create your XML, simply change the encoding to the most basic possible (US-ASCII). This will automatically force the XML writer to use the proper numerical entities. When you deserialize, this will be properly parsed in your strings without further ado. Something along these lines:
Stream stream = new MemoryStream();
XmlWriterSettings settings = new XmlWriterSettings();
settings.Encoding = Encoding.ASCII;
XmlWriter writer = XmlWriter.Create(stream, settings);
// make sure to output the xml-prolog header
But be aware of using StringBuilder or StringWriter, because it is fixed to using UTF-16, and the XmlWriter will always write in that encoding, more info on that issue at my blog, which is not compatible with SQL Server.
Note: when using the ASCII encoding, any character higher than 0x7F will be encoded. So, é will look like &#xE9 and the dash may look like &#x2014, but this means just the same and you should not worry about that. Every XML capable tool will properly interpret this input.
Note 2: the location where you want to change the way XML is written is the Web Service you talk of, that receives XML and then stores it into the SQL Server database. Before storing into SQL Server, the change must be applied. Earlier on in the chain is useless.
public static T DeserializeFromXml<T>(string xml)
{
T result;
XmlSerializerFactory serializerFactory = new XmlSerializerFactory();
XmlSerializer serializer =serializerFactory.CreateSerializer(typeof(T));
using (StringReader sr3 = new StringReader(xml))
{
XmlReaderSettings settings = new XmlReaderSettings()
{
CheckCharacters = false // default value is true;
};
using (XmlReader xr3 = XmlTextReader.Create(sr3, settings))
{
result = (T)serializer.Deserialize(xr3);
}
}
return result;
}

How can I generate XML with CR, instead of CRLF in XmlTextWriter

I'm generating XML via XmlTextWriter.
The file looks good to my eyes, validates (at wc3), and was accepted by the client.
But a client vendor is complaining that the line-endings are CRLF, instead of just CR.
Well, I'm on a Win32 machine using C#, and CRLF is the Win32 standard line-ending.
Is there any way to change the line-endings in the XmlTextWriter?
Also -- shouldn't the line-endings not matter to a proper XML parser?
see also: What are carriage return, linefeed, and form feed?
NOTE: looks like the only answer is a sideways solution -- you have to use the XmlWriter instead of the XmlTextWriter
of course, moments after asking, I find a clue on MSDN (that I couldn't find via google) that refers to XmlWriterSettings.NewLineChars
which then led me to the unaccepted answer on SO: Writing XMLDocument to file with specific newline character (c#)
It's all in the terminology.....
Use the XmlWriterSettings to set what you want as your end of line char.
XmlWriterSettings mySettings = new XmlWriterSettings();
mySettings.NewLineChars = "\r";
XmlWriter writer = XmlWriter.Create(
new StreamWriter(#"c:\temp\hello.xml", mySettings);
I don't know where end of line characters would matter. I haven't run into it before.
What line ending is used should not matter to a properly implemented parser (see the spec), I quote (emphasis mine):
To simplify the tasks of applications,
the XML processor must behave as if it
normalized all line breaks in external
parsed entities (including the
document entity) on input, before
parsing, by translating both the
two-character sequence #xD #xA and any #xD
that is not followed by #xA to a single #xA character.
Therefore, you should be fine with the way you have it right now. You might want to ask what the client vendor is actually doing there, chances are that they are Doing it Wrong.
Use the XmlWriterSettings.NewLineChars property.

Categories

Resources