SQL Server UDF SQLCLR Call Converts Characters Into Question Marks - c#

I've found nothing on Google or SO that quites lines up with my issue.
In SQL Server, I have a scalar function (we'll call it dbo.MySqlStringFunction).
What this function does is call a utility written in C# that calls an ASP.Net view and returns the HTML as a SqlString.
The function definition in SQL Server is:
RETURNS [nvarchar](max) WITH EXECUTE AS CALLER
AS EXTERNAL NAME [Utils.UserDefinedFunctions].[MySqlStringFunction]
The C# code simplified is:
var request = (HttpWebRequest)WebRequest.Create("www.mydomain.com");
using (var response = (HttpWebResponse)request.GetResponse())
using (var stream = response.GetResponseStream())
{
using (var streamReader = new StreamReader(stream, Encoding.UTF8)
{
return new SqlString(streamReader.ReadToEnd());
}
}
When I put the C# code into a console app and run it, I get everything exactly as it should be.
When I access the URL directly in my browser, it displays exactly as it should be.
When I do SELECT MySqlStringFunction() however, characters such as ™, §, ¤ display as 2 or 3 question marks each.
It appears that it is somewhere between the return new SqlString(..) and the sql function returning the value that something is going wonky. But I'm at a loss as to what it could be.

It seems that the issue was the location of the return. The current code (shown in the Question), is returning in the middle of 3 using blocks, one of which is the UTF-8 stream being read. This probably confused things as SQLCLR is isolated memory from the main SQL Server memory, and usually you can't return via a stream. It is best to close the open stream first and let the using blocks call Dispose(). Hence:
Create a string above the first using (i.e. string _TempReturn = String.Empty;)
Inside the inner-most using, replace return with: _TempReturn = streamReader.ReadToEnd();
Below the last using closing bracket, add: return new SqlString(_TempReturn);
(old answer, will remove in the near future)
The problem is with the encoding difference between the web page and SQL Server. You are using Encoding.UTF8 for the web page (which is quite likely correct given that UTF-8 is the most common encoding for the interwebs), but SQL Server (along with .NET and Windows in general) is UTF-16 Little Endian. This is why you are getting 2 or 3 ?s for each character above Code Point 127: UTF-8 is a multi-byte encoding that uses 1, 2, or 3 bytes per character, whereas UTF-16 is always 2-bytes (well, supplementary characters are 4 bytes, but that is due to being a pair of double-byte values).
You need to convert the encoding to UTF-16 Little Endian before, or as, you pass back the stream. And, UTF-16 Little Endian is the Unicode encoding in .NET, while Big Endian Unicode refers to "UTF-16 Big Endian". So you want to convert to the Unicode encoding.
OR, it could be the reverse: that the web page is NOT UTF-8, in which case you have declared it incorrectly in the StreamReader. If this is true, then you need to specify the correct encoding in the StreamReader constructor.

Related

WE8DEC (or MCS) encoding in c#

Okay, I have this big .NET project which uses multiple databases with a lot of already-written requests. The databases all uses WE8DEC as the character system, until now, all the data was latin and there was no problem.
But I now have the task to use a new database, again in WE8DEC, but this database stores russian data, written in cyrillic. Using a tool like DBeaver, it shows data like ÇÎËÎÒÀ�Å instead of the actual cyrillic text.
I know I can retrieve the byte data directly from the database using the dump function to retrieve the bytes and then convert them.
WORD | DUMP(WORD)
ÇÎËÎÒÀ�Å | Typ=1 Len=9: 199,206,203,206,210,192,208,197,194
But I don't feel like duplicating/altering all my request and the way I retrieve the results in c#, I have a place just before sending the data as JSON where I could just reincode all the string before sending them.
So I was looking for a way to retrieve the bytes just like in Oracle and found a way using this line of code :
byte[] bytes = Encoding.GetEncoding("Windows-1252").GetBytes(word);
But my main problem is this, I don't find any exact equivalent of the WE8DEC encoding from Oracle in .NET, Windows-1252 is the closest I found (but still incorrect).
So the question, is there an exact equivalent of WE8DEC, also called MCS, in c#?

Python/C#, Reading File into Byte Array - Not Quite the Same Result

I'm attempting to read a file and process it in both C# and IronPython, but I'm running into a slight problem.
When I read the file in either language, I get a byte array that's almost identical, but not quite.
For instance, the array has 1552 bytes. They're all the same except for one thing. Any time the value "10" appears in the Python implementation, the value "13" appears in the C# implementation. Aside from that, all other bytes are the same.
Here's roughly what I'm doing to get the bytes:
Python:
f = open('C:\myfile.blah')
contents = f.read()
bytes = bytearray(contents, 'cp1252')
C#:
var bytes = File.ReadAllBytes(#"C:\myfile.blah");
Perhaps I'm choosing the wrong encoding? Though I wouldn't think so, since the Python implementation behaves as I would expect and processes the file successfully.
Any idea what's going on here?
(I don't know python) But it looks like you need to pass the 'rb' flag:
open('C:\myfile.blah', 'rb')
Reference:
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written.
Note that the values 10 and 13 give clues as to what the problem is:
Line feed is 10 in decimal and Carriage return is 13 in decimal.

Printing time to binary file in C# .net

I got this next problem.
I have a binary file, which I write to it vital data of the system.
One of the fields is time, which I use DateTime.Now.ToString("HHmmssffffff), in format of microseconds. This data (in a string) I convert (to ToCahrArray) (and checked it in debugging in it is fine), it consists of time valid till the microseconds.
Then I write it and flush it to the file. When opening it with PsPad that translate binary to Ascii, I see that data is corrupted in this field and a nother one but the rest of the message is fine.
The code:
void Write(string strData) {
char[] cD = strData.ToCharArry();
bw.Write(c); //br is from type of BinaryWriter
bw.Flush();
}
You're writing out the bytes in Unicode characters, not Ascii bytes. If you want Ascii bytes, you should change this to use the Encoding class.
byte[] data = Encoding.ASCII.GetBytes(strData);
bw.Write(data);
I strongly recommend reading Joel Spolsky's article on character sets and encoding. It may help you understand what your current code is not working properly.

Reading an mbox file in C#

One of our staff members has lost his mailbox but luckily has a dump of his email in mbox format. I need to somehow get all the messages inside the mbox file and squirt them into our tech support database (as its a custom tool there are no import tools available).
I've found SharpMimeTools which breaks down a message but not allow you to iterate through a bunch of messages in a mbox file.
Does anyone know of a decent parser thats open without having to learn the RFC to write one out?
I'm working on a MIME & mbox parser in C# called MimeKit.
It's based on earlier MIME & mbox parsers I've written (such as GMime) which were insanely fast (could parse every message in an 1.2GB mbox file in about 1 second).
I haven't tested MimeKit for performance yet, but I am using many of the same techniques in C# that I used in C. I suspect it'll be slower than my C implementation, but since the bottleneck is I/O and MimeKit is written to do optimal (4k) reads like GMime is, they should be pretty close.
The reasons you are finding your current approach to be slow (StreamReader.ReadLine(), combining the text, then passing it off to SharpMimeTools) are because of the following reasons:
StreamReader.ReadLine() is not a very optimal way of reading data from a file. While I'm sure StreamReader() does internal buffering, it needs to do the following steps:
A) Convert the block of bytes read from the file into unicode (this requires iterating over the bytes in the byte[] read from disk to convert the bytes read from the stream into a unicode char[]).
B) Then it needs to iterate over its internal char[], copying each char into a StringBuilder until it finds a '\n'.
So right there, with just reading lines, you have at least 2 passes over your mbox input stream. Not to mention all of the memory allocations going on...
Then you combine all of the lines you've read into a single mega-string. This requires another pass over your input (copying every char from each string read from ReadLine() into a StringBuilder, presumably?).
We are now up to 3 iterations over the input text and no parsing has even happened yet.
Now you hand off your mega-string to SharpMimeTools which uses a SharpMimeMessageStream which... (/facepalm) is a ReadLine()-based parser that sits on top of another StreamReader that does charset conversion. That makes 5 iterations before anything at all is even parsed. SharpMimeMessageStream also has a way to "undo" a ReadLine() if it discovers it has read too far. So it is reasonable to assume that he is scanning over some of those lines at least twice. Not to mention all of the string allocations going on... ugh.
For each header, once SharpMimeTools has its line buffer, it splits into field & value. That's another pass. We are up to 6 passes so far.
SharpMimeTools then uses string.Split() (which is a pretty good indication that this mime parser is not standards compliant) to tokenize address headers by splitting on ',' and parameterized headers (such as Content-Type and Content-Disposition) by splitting on ';'. That's another pass. (We are now up to 7 passes.)
Once it splits those it runs a regex match on each string returned from the string.Split() and then more regex passes per rfc2047 encoded-word token before finally making another pass over the encoded-word charset and payload components. We're talking at least 9 or 10 passes over much of the input by this point.
I give up going any farther with my examination because it's already more than 2x as many passes as GMime and MimeKit need and I know my parsers could be optimized to make at least 1 less pass than they do.
Also, as a side-note, any MIME parser that parses strings instead of byte[] (or sbyte[]) is never going to be very good. The problem with email is that so many mail clients/scripts/etc in the wild will send undeclared 8bit text in headers and message bodies. How can a unicode string parser possibly handle that? Hint: it can't.
using (var stream = File.OpenRead ("Inbox.mbox")) {
var parser = new MimeParser (stream, MimeFormat.Mbox);
while (!parser.IsEndOfStream) {
var message = parser.ParseMessage ();
// At this point, you can do whatever you want with the message.
// As an example, you could save it to a separate file based on
// the message subject:
message.WriteTo (message.Subject + ".eml");
// You also have the ability to get access to the mbox marker:
var marker = parser.MboxMarker;
// You can also get the exact byte offset in the stream where the
// mbox marker was found:
var offset = parser.MboxMarkerOffset;
}
}
2013-09-18 Update: I've gotten MimeKit to the point where it is now usable for parsing mbox files and have successfully managed to work out the kinks, but it's not nearly as fast as my C library. This was tested on an iMac so I/O performance is not as good as it would be on my old Linux machine (which is where GMime is able to parse similar sized mbox files in ~1s):
[fejj#localhost MimeKit]$ mono ./mbox-parser.exe larger.mbox
Parsed 14896 messages in 6.16 seconds.
[fejj#localhost MimeKit]$ ./gmime-mbox-parser larger.mbox
Parsed 14896 messages in 3.78 seconds.
[fejj#localhost MimeKit]$ ls -l larger.mbox
-rw-r--r-- 1 fejj staff 1032555628 Sep 18 12:43 larger.mbox
As you can see, GMime is still quite a bit faster, but I have some ideas on how to improve the performance of MimeKit's parser. It turns out that C#'s fixed statements are quite expensive, so I need to rework my usage of them. For example, a simple optimization I did yesterday shaved about 2-3s from the overall time (if I remember correctly).
Optimization Update: Just improved performance by another 20% by replacing:
while (*inptr != (byte) '\n')
inptr++;
with:
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
Optimization Update: I was able to finally make MimeKit as fast as GMime by switching away from my use of Enum.HasFlag() and using direct bit masking instead.
MimeKit can now parse the same mbox stream in 3.78s.
For comparison, SharpMimeTools takes more than 20 minutes (to test this, I had to split the emails apart into separate files because SharpMimeTools can't parse mbox files).
Another Update: I've gotten it down to 3.00s flat via various other tweaks throughout the code.
I don't know any parser, but mbox is really a very simple format. A new email begins on lines starting with "From " (From+Space) and an empty line is attached to the end of each mail. Should there be any occurence of "From " at the beginning of a line in the email itself, this is quoted out (by prepending a '>').
Also see Wikipedia's entry on the topic.
If you can stretch to using Python, there is one in the standard library. I'm unable to find any for .NET sadly.
To read .mbox files, you can use a third-party library Aspose.Email.
This library is a complete set of Email Processing APIs to build cross-platform applications having the ability to create, manipulate, convert, and transmit emails without using Microsoft Outlook.
Please, take a look at the example I have provided below.
using(FileStream stream = new FileStream("ExampleMbox.mbox", FileMode.Open, FileAccess.Read))
{
using(MboxrdStorageReader reader = new MboxrdStorageReader(stream, false))
{
// Start reading messages
MailMessage message = reader.ReadNextMessage();
// Read all messages in a loop
while (message != null)
{
// Manipulate message - show contents
Console.WriteLine("Subject: " + message.Subject);
// Save this message in EML or MSG format
message.Save(message.Subject + ".eml", SaveOptions.DefaultEml);
message.Save(message.Subject + ".msg", SaveOptions.DefaultMsgUnicode);
// Get the next message
message = reader.ReadNextMessage();
}
}
}
It is easy to use. I hope this approach will satisfy you and other searchers.
I am working as a Developer Evangelist at Aspose.

Is there a better way to convert to ASCII from an arbitrary input?

I need to be able to take an arbitrary text input that may have a byte order marker (BOM) on it to mark its encoding, and output it as ASCII. We have some old tools that don't understand BOM's and I need to send them ASCII-only data.
Now, I just got done writing this code and I just can't quite believe the inefficiency here. Four copies of the data, not to mention any intermediate buffers internally in StreamReader. Is there a better way to do this?
// i_fileBytes is an incoming byte[]
string unicodeString = new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd();
byte[] unicodeBytes = Encoding.Unicode.GetBytes(unicodeString.ToCharArray());
byte[] ansiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string ansiString = Encoding.ASCII.GetString(ansiBytes);
I need the StreamReader() because it has an internal BOM detector to choose the encoding to read the rest of the file. Then the rest is just to make it convert into the final ASCII string.
Is there a better way to do this?
If you've got i_fileBytes in memory already, you can just check whether or not it starts with a BOM, and then convert either the whole of it or just the bit after the BOM using Encoding.Unicode.GetString. (Use the overload which lets you specify an index and length.)
So as code:
int start = (i_fileBytes[0] == 0xff && i_fileBytes[1] == 0xfe) ? 2 : 0;
string text = Encoding.Unicode.GetString(i_fileBytes, start, i_fileBytes.Length-start);
Note that that assumes a genuinely little-endian UTF-16 encoding, however. If you really need to detect the encoding first, you could either reimplement what StreamReader does, or perhaps just build a StreamReader from the first (say) 10 bytes, and use the CurrentEncoding property to work out what you should use for the encoding.
EDIT: Now, as for the conversion to ASCII - if you really only need it as a .NET string, then presumably all you want to do is replace any non-ASCII characters with "?" or something similar. (Alternatively it might be better to throw an exception... that's up to you, of course.)
EDIT: Note that when detecting the encoding, it would be a good idea to just call Read() a single time to read one character. Don't call ReadToEnd() as by picking 10 bytes as an arbitrary amount of data, it might end mid-character. I don't know offhand whether that would throw an exception, but it has no benefits anyway...
System.Text.Encoding.ASCII.GetBytes(new StreamReader(new MemoryStream(i_fileBytes)).ReadToEnd())
That should save a few round-trips.

Categories

Resources