C# XmlWriter and invalid UTF8 characters - c#

We created a unit test that uses the following methods to generate random UTF8 text:
private static Random _rand = new Random(Environment.TickCount);
public static byte CreateByte()
{
return (byte)_rand.Next(byte.MinValue, byte.MaxValue + 1);
}
public static byte[] CreateByteArray(int length)
{
return Repeat(CreateByte, length).ToArray();
}
public static string CreateUtf8String(int length)
{
return Encoding.UTF8.GetString(CreateByteArray(length));
}
private static IEnumerable<T> Repeat<T>(Func<T> func, int count)
{
for (int i = 0; i < count; i++)
{
yield return func();
}
}
In sending the random UTF8 strings to our business logic, XmlWriter writes the generated string and can fail with the error:
Test method UnitTest.Utf8 threw exception:
System.ArgumentException: ' ', hexadecimal value 0x0E, is an invalid character.
System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
System.Xml.XmlUtf8RawTextWriter.WriteAttributeTextBlock(Char* pSrc, Char* pSrcEnd)
System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text)
System.Xml.XmlWellFormedWriter.WriteString(String text)
System.Xml.XmlWriter.WriteAttributeString(String localName, String value)
We want to support any possible string to be passed in, and need these invalid characters escaped somehow.
XmlWriter already escapes things like &, <, >, etc., how can we deal with other invalid characters such as control characters, etc?
PS - let me know if our UTF8 generator is flawed (I'm already seeing where I shouldn't let it generate '\0')

The XmlConvert Class has a lot of useful methods (like EncodeName, IsXmlChar, ...) for making sure you're building valid Xml.

There are two problems:
Not all characters are valid for XML, even escaped. For XML 1.0, the only characters with a Unicode codepoint value of less than 0x0020 that are valid are TAB ( ), LF (
), and CR (
). See XML 1.0, Section 2.2, Characters .
For XML 1.1, which relatively few systems support, any character except NUL can be escaped in this manner.
Not all sequences of bytes are valid for UTF-8. For example, according to the specification, "The octet values C0, C1, F5 to FF never appear." Probably you would be better off just creating Strings of characters and ignoring UTF-8, or creating the String, converting it to UTF-8 and back if you're really into encoding.

Your UTF-8 generator appears to be flawed. There are many byte sequences which are invalid UTF-8 encodings.
A better way to generate valid random UTF-8 encodings is to generate random characters, put them into a string and then encode the string to UTF-8.

Mark points out that not every byte sequence is a valid UTF-8 sequence.
I'd like to add that not every character can exist in an XML document. Only some characters are valid, and this is true even if they are encoded as a numeric character reference.
Update: If you want to encode arbitrary binary data in XML, then use Base64 or some other encoding before writing them to XML.

Related

How to perform mutliple Replace calls at once

I have a bit of a weird question here at hands. I have a text that's encoded in such a way that each character is replaced by another character and I'm creating an application that will replace each character with a correct one. But I've come across a problem that I have trouble solving. Let me show with an example:
Original text: This is a line.
Encoded text: (.T#*T#*%*=T50;
Now, as I said, each character represents another character, '(' is 'T', '.' is actually a 'h' and so on.
Now I could just go with
string decoded = encoded.Replace('(','T'); //T.T#*T#*%*=T50;
And that will solve one problem, but when I reach character 'T' that is actually encoded character 'i' I will have to replace all 'T' with 'i', which means that all previously decoded letter 'T's (that were once '(') will also change along with the encoded 'T'.
//T.T#*T#*%*=T50; -> i.i#*i#*%*=i50;
in this situation it's obvious that I should've just went the other way around, first change 'T' to 'i' and then '(' to 'T', but in the text I'm changing that kind of analysis is not an option.
What's the alternative here that I could do to perform the task correctly?
Thank you!
One possible solution is do not use replace string method at all.
Instead you can create method which for every encoded character will output decoded one, and then go through your string as through array of char and for every character in this array use "decryption" method to get decoded character - thus you'll receive decoded string.
For example (using StringBulder to create new string):
private static char Decode(char source)
{
if (source == '(')
return 'T';
else if (source == '.')
return 'h';
//.... and so on
}
string source = "ABC";
var builder = new StringBuilder();
foreach (var c in source)
builder.Append(Decode(c));
var result = builder.ToString();
Using .Replace() probably isn't the way to go in the first place, since as you're finding it covers the whole string every time. And once you've modified the whole string once, the encoding is lost.
Instead, loop over the string one time and replace characters individually.
Create a function that accepts a char and returns the replaced char. For simplicity, I'll just show the signature:
private char Decode(char c);
Then just loop over the string and call that function on each character. LINQ can make short work of that:
var decodedString = new string(encodedString.Select(c => Decode(c)).ToArray());
(This is freehand and untested, you may or may not need that .ToArray() for the string constructor to be happy, I'm not certain. But you get the idea.)
If it's easier to read you can also just loop manually over the string and perhaps use a StringBuilder with each successive char to build the final decoded result.
Without knowledge of your encryption algorithm, this answer assumes that it's a simple character translation akin to the Caesar Cipher.
Pass in your encrypted string, the method loops over each character, adjusting it by the value of shiftDelta and returns the resulting string.
private string Decrypt(string input)
{
const int shiftDelta = 10;
var inputChars = input.ToCharArray();
var outputChars = new char[inputChars.Length];
for (var i = 0; i < outputChars.Length; i++)
{
// Perform character translation here
outputChars[i] = (char)(inputChars[i] + shiftDelta);
}
return outputChars.ToString();
}

Converting from System.String to Sytstem.byte while reading from MYSQL

I have a BLOB file which Im reading from database. Following is the code:
byte[] bytes;
sdr.Read();
bytes = (byte[])sdr["proposalDoc"];
But the below exception occurs:
"unable to convert from system.string to system.byte"
I wrote the following before noticing your clarification that the value returned as a string is really just a binary blob. If that's correct, than the link provided by the other commenters looks like what you need. But if the "blob" is actually a series of ASCII characters transformed to Unicode (or a stream of bytes where each byte was transformed into a word by setting the high order byte to 0), then something like the following would apply.
Assuming that the field returned by sdr["proposalDoc"] is really just an ASCII string converted to Unicode, and that all you're trying to do is reconstruct the ASCII byte string (nul-terminated), you could do something like the following. (Note, there may be more optimal ways of doing this, but this could get you started.)
// Get record field.
string tempString = sdr["proposalDoc"];
// Create byte array to hold one ASCII character per Unicode character
// in the field, plus a terminating nul.
bytes = new byte[tempString.Length + 1];
// Copy each character from the field to the byte array,
// keeping the low byte of the character.
int i = 0;
foreach (char c in tempString)
{
bytes[i++] = (byte)c;
}
// Store the terminating nul character, assuming a
// nul-terminated ASCII string is the desired outcome.
bytes[i]=0;

How to fix encoding, im getting ascii values of 63 when it should be regular white spaces

In my c# code, I am extracting text from a pdf, but the text it gives back has some weird characters, if I search for "CLE action" when I know there is the text "CLE action" in the pdf document, it gives me a false, but I found out that after extracting the text, the space between the two words has a ascii byte value of 63...
Is there a quick way to fix the encoding on the text?
Currently I am using this method, but I think it's slow and only works for that one character. Is there some fast method that works for all characters?
public static string fix_encoding(string src)
{
StringWriter return_str = new StringWriter();
byte[] byte_array = Encoding.ASCII.GetBytes(src.Substring(0, src.Length));
int len = byte_array.Length;
byte byt;
for(var i=0; i<len; i+=1)
{
byt = byte_array[i];
if (byt == 63)
{
return_str.Write(" ");
}
else
{
return_str.Write(Encoding.ASCII.GetString(byte_array, i, 1));
}
}
return return_str.ToString();
}
This is how I call this method:
StringWriter output = new StringWriter();
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
currentText = fix_encoding(output.ToString());
It is possible that the spaces you extract from the pdf file, are no real spaces (" "), but other kind of spaces defined in unicode. For example a "em space" or a "non break space", see this list or here for a overview.
If the extracted text contains such a space, and you search the text for a normal space, you won't find it, because it is not identical.
Your fix_encoding function converts the string to ASCII. All the unusual kinds of spaces do not exist in ASCII. By default non-ASCII characters are converted to a question mark. So in your fix_encoding function, you see a question mark, even though the original text has a different character.
This means in your fix_encoding function, you should not convert to ASCII, but replace unusual spaces with a normal space. The following function will convert all non-ASCII characters, but you could also use Char.IsWhiteSpace to determine which characters to replace with a normal space.
public static string remove_non_ascii(string src)
{
return Regex.Replace(src, #"[^\u0000-\u007F]", " ");
}

How to prevent conversion of Windows-1252 argument into a Unicode string?

I've written my first COM classes. My unit tests work fine, but my first use of the COM objects has hit a snag.
The COM classes provide methods which accept a string, manipulate it and return a string. The consumer of the COM objects is a dBASE PLUS program.
When the input string contains common keyboard characters (ASCII 127 or lower), the COM methods work fine. However, if the string contains characters beyond the ASCII range, some of them get remapped from Windows-1252 to C#'s Unicode. This table shows the mapping that takes place: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
For example, if the dBASE program calls the COM object with:
oMyComObject.MyMethod("It will cost€123") where the € is hex 80,
the C# method receives it as Unicode:
public string MyMethod(string source)
{
// source is Unicode and now the Euro symbol is hex 20AC
...
}
I would like to avoid this remapping because I want the original hex content of the string.
I've tried adding the following to MyMethod to convert the string back to Windows-1252, but the Euro symbol gets lost because it becomes a question mark:
byte[] UnicodeBytes = Encoding.Unicode.GetBytes(source.ToString());
byte[] Win1252Bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), UnicodeBytes);
string Win1252 = Encoding.GetEncoding(1252).GetString(Win1252Bytes);
Is there a way to prevent this conversion of the "source" parameter to Unicode? Or, is there a way to convert it 100% from Unicode back to Windows-1252?
Yes, I'm answering my own question. The answer by "Jigsore" put me on the right track, but I want to explain more clearly in case someone else makes the same mistake I made.
I eventually figured out that I had misdiagnosed the problem. dBASE was passing the string fine and C# was receiving it fine. It was how I checked the contents of the string that was in error.
This turnkey builds on Jigsore's answer:
void Main()
{
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (int i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]); // output: 0x80 0x8A 0x8C 0x9A
// win1252String represents the string passed from dBASE to C#
string win1252String = Encoding.GetEncoding(1252).GetString(win1252bytes);
Console.WriteLine("\r\nWin1252 string is " + win1252String); // output: Win1252 string is €ŠŒš
Console.WriteLine("looking at the code of the first character the wrong way: " + (int)win1252String[0]);
// output: looking at the code of the first character the wrong way: 8364
byte[] bytes = Encoding.GetEncoding(1252).GetBytes(win1252String[0].ToString());
Console.WriteLine("looking at the code of the first character the right way: " + bytes[0]);
// output: looking at the code of the first character the right way: 128
// Warning: If your input contains character codes which are large in value than what a byte
// can hold (ex: multi-byte Chinese characters), then you will need to look at more than just bytes[0].
}
The reason the first method was wrong is that casting (int)win1252String[0] (or the converse of casting an integer j to a character with (char)j) involves an implicit conversion with the Unicode character set C# uses.
I consider this resolved and would like to thank each person who took the time to comment or answer for their time and trouble. It is appreciated!
Actually you're doing the Unicode to Win-1252 conversion correctly, but you're performing an extra step. The original Win1252 codes are in the Win1252Bytes array!
Check the following code:
string unicodeText = "\u20AC\u0160\u0152\u0161";
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeText);
byte[] win1252bytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(1252), unicodeBytes);
for (i = 0; i < win1252bytes.Length; i++)
Console.Write("0x{0:X2} ", win1252bytes[i]);
The output shows the Win-1252 codes for the unicodeText string, you can check this by looking at the CP1252.TXT table.

Base64 in C# and PL/SQL?

In PL/SQL how can I convert a string (long HTML string with new line and tags, etc) to Base64 that is easy to decrypt in C#?
In C# there are:
Convert.ToBase64String()
Convert.ToBase64CharArray()
BitConverter.ToString()
which one is compatible with PL/SQL
utl_encode.base64_encode();
?
I welcome any other suggestions :-)
You'll probably want to use this method:
Convert.ToBase64String()
It returns a Base64 encoded String based off an array of unsigned 8-bit integers (bytes).
As an alternate, you can use Convert.ToBase64CharArray(), but the output is a character array, which is a bit odd but may be useful in certain circumstances.
The method BitConverter.ToString() returns a String, but the bytes are represented in Hexadecimal, not Base64 encoded.
I done it :-)
PL/SQL
s1 varchar2(32767);
s2 varchar2(32767);
s2:= utl_raw.cast_to_varchar2(utl_encode.base64_encode(utl_raw.cast_to_raw(s1)));
s2:= utl_raw.cast_to_varchar2(utl_encode.base64_decode(utl_raw.cast_to_raw(s1)));
are compatible with C#
public static string ToBase64(string str)
{
return Convert.ToBase64String(Encoding.UTF8.GetBytes(str));
}
//++++++++++++++++++++++++++++++++++++++++++++++
public static string FromBase64(string str)
{
return Encoding.UTF8.GetString(Convert.FromBase64String(str));
}
hope you find it useful :-)

Categories

Resources