Read non-Unicode, non-English text from SQLite database

Read non-Unicode, non-English text from SQLite database - c#

We have an old MFC c++ application that writes into a SQLite database and another C# application that reads from the database.
The C++ application writes "шаг потока работы" (Cyrillic characters) in the database from a Russian Windows computer.
When the same C++ application reads it on a Western European (Latin) Windows computer, it reads it as "øàã ïîòîêà ðàáîòû" (Latin representation).
When the C# application reads it, it reads it as "��� ������ ����" (Unicode representation).
None of these applications specify encoding type in the database. I want to read the original text in C#.
I couldn't find a proper way to specify the Encoding type when reading the text.
I've tried connection strings such as Data Source=c:\mydb.db;Version=3;UTF8Encoding=True; but no luck so far.
Also I tried to get the byte array from "��� ������ ����" and convert to Cyrillic but failed.
Does anyone happen know how to read the original Russian text back from a SQLite database?

All the normal functions in the SQLite C API use UTF-8. The C# SQLite driver automatically converts between UTF-8 and the C# string encoding.
If you do not get correct data from your C# program, then it's likely that the C++ application did not actually UTF-8. This is confirmed by the fact that the C++ application gives different results with different code pages.
If possible, fix the C++ application, or convert the data in the database from the original encoding to UTF-8. As a last resort, you could change your C# application to read all the strings as blobs and then convert them from the original encoding to UTF-8. (In any case, you need to know what the original encoding is.)

Related

Creating a spss file in multiple languages

SPSS dll i have does not support encoding. It reads French and German OK but any double byte language string, e.g. Chinese, Japanese, or Russian, reads back as "???"
// At binary level, the buffer actually contains "3f" for each "?". This may be the result of a failed unicode conversion.
When we write into .sav file, the interface only allows us to set interface encoding (not file level encoding, or code page). I suspect maybe the dll does not support double byte languages.
I have created SPSSDocument for write data but cannot find any encoding while creating file.
Can anyone help me in this issue?

DynamoDB automatically converting special characters

Working with DynamoDB and AWS (.net C#). For some reason when saving strings containing "é" get replaced with a question mark when saved.
How can I prevent it from happening?

DynamoDB stores strings in UTF-8 encoding. Somewhere in the your application you must be assigning that string in something other than UTF-8.
Im using Java (which uses UTF-16). I don't do anything special when storing strings. I just tried storing and retrieving "é" in DynamoDB using the Java SDK and there was no problem.

Just to add to the previous answer, your IDE will generally have encoding settings. You'll want to change the string encoding for your project to UTF-8 to minimize the changes of an encoding error, which can create what is thought to be an unknown string.
For example, in Eclipse editors, you can see this answer to change your encoding.

Unzipping with ExtractToDirectory method distorts non-latin symbols

I have several folders with files, some folders contain non-latin symbols in their names (russian in my case). This folders are sending to zip archive (by windows explorer) in "D:\test.zip".
Then I execute method
ZipFile.ExtractToDirectory(#"D:\test.zip", #"D:\result");
and it successfully unzip all content, but all non-latin symbols turn into something wrong.
For example, instead of "D:\result\каскады\file.txt" I got "D:\result\Є бЄ ¤л\file.txt".
Default encoding of my system is windows-1251 which I verified by involving Encoding.GetEncoding("windows-1251") into third parameter of ExtractToDirectory and getting the same result. I also tried UTF-8, but got another artifacts inside path ("D:\result\��᪠��\file.txt"). Trying Unicode return me message about not supported encoding.
When I create same archive through the code by executing method
ZipFile.CreateFromDirectory(#"D:\zipdata", #"D:\test.zip");
everything then unzipping fine with the same line of code as in the top of the question, even without specifying particular encodings.
The question is: how to get correct encoding from archive for applying it in ExtractToDirectory method, in respect that in real task archive comes from external source and I can not rely on wherether it created 'by hands' or programmatically?
Edit
There is question where also non-latin symbols (chinese) cause problems, but this fact was given like resolution of question, whereas this is exactly problem for my situation.

There is no formally standardized ZIP specification. However, the de facto standard is the PKZIP "application note" document, which as of 2006 documents only code page 437 ("OEM United States") and UTF8 as legal text encodings for file entries in the archive:
D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.
D.2 If general purpose bit 11 is unset, the file name and comment should conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment must support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).
In other words, it's a bug in any ZIP authoring tool to use any text encoding other than code page 437 or UTF8. Based on your experience, it appears Windows Explorer has this bug. :(
Unfortunately, the "general purpose bit 11" is the only official mechanism for indicating the actual text encoding used in the archive, and this allows only for either the original 437 code page or UTF8. Even this bit was not supported by .NET until .NET 4.5. In any case, even since that time it is not possible for .NET or any other ZIP archive-aware software to reliably determine a non-standard, unsupported encoding used to encode the file entry names in the archive.
However, you can, if the source machine used to create the archive is known and available, determine the default code page installed on that machine, via the CultureInfo class. The following expression will return the code page identifier installed on the machine where the expression is executed (assuming the process hasn't changed its current culture from the default, of course):
System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage
This gives you the code page ID that can be passed to Encoding.GetEncoding(Int32) to retrieve an Encoding object that can then be passed to the appropriate ZipArchive constructor when opening an existing archive, to ensure that the file entry names are decoded correctly.
If you are unable to retrieve the actual text encoding from the machine that is the origin of the archive, then you're stuck enumerating the encodings, trying each one until you find one that reports entry names in a legible format.
As I understand it, Windows 8 and later can support the UTF8 flag in the ZIP archive file. I haven't tried it, but it's possible that such versions of Windows also write archives using that flag. If so, that would (one hopes) mitigate the pain of the earlier Windows bug.
Finally note that a custom tool could record the encoding in a special file entry placed in the archive itself. Of course, only that tool would be able to recognize the special file and use it to determine the correct encoding (the tool would have to open the archive twice: once to retrieve the file, and then a second time once the tool has determined the encoding). This is not an ideal solution and of course is no help for archives created by Windows Explorer. I mention it only for the sake of completeness.

Web API action returns FileContentResult that, if saved as .csv, will open as gibberish , while if as .txt, is ok. Why?

I am exporting a file via a http get response, using ASP.NET Web API.
For that, I am returning a FileContentResult object, as in:
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
After several minutes stucked with encoding issues, I use google's Advanced REST Client to perform the get to the web api controller's action, and the file is being download just ok.
Well, not exactly. I originally wanted it to be sent/downloaded as a .csv file.
If I set the http request content-type to "text/csv" and the File() call sets the response's content type to "text/csv" just as well, Advanced REST Client will show the contents properly, but excel will open it as gibberish data.
If I simply change the content-type to "text/plain", save it as a .txt file (have to rename it after saving, don't know why it is being saved as _.text-plain, while as a csv it is being saved with .csv extension), and finally perform an import in Excel like described here Excel Import Text Wizard, then then excel opens the file correctly.
Why is the .csv being opened as gibberish, while as a .txt it is not ? For opening a .csv, there is no import wizard like with a .txt file (not that I am aware of).
Providing a bit of the source below:
StringBuilder fileContents = new StringBuilder();
//csv header
fileContents.AppendLine(String.Join(CultureInfo.CurrentCulture.TextInfo.ListSeparator, fileData.Select(fileRecord => fileRecord.Name)));
//csv records
foreach (ExportFileField fileField in fileData)
fileContents.AppendLine(fileField.Value);
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
As requested, the binary contents of both files.
The text-plain (.txt) version (the one that will open in excel, using import):
and the .csv one (the one that excel will open with junk data):
The (files are the same, the cropping of the screen shots was not the same...)

I was able to reproduce the issue by saving a file containing Greek characters with BOM. Double clicking attempts to import the file using the system's locale (Greek). When manually importing, Excel detects the codepage and offers to use the 65001 (UTF8) codepage.
This behavior is strange but not a bug. Text files contain no indication that would help detect their codepage, nor is it possible to guess. An ASCII file containing only A-Z characters saved as 1252 is identical to one saved using 1253. That's why Windows uses the system codepage, which is the local used for all non-Unicode programs and files.
When you double click on a text file, Excel can't ask you for the correct encoding - this could get tedious very quickly. Instead, it opens the file using your regional settings and the system codepage. ASCII files created on your machine are saved using your system's codepage so this behaviour is logical. Files given to you by non-programmers will probably be saved using your country's codepage as well. Programmers typically switch everything to US English and that's how problems start. Your REST client may have saved the text as ASCII using the Latin encoding used by most programmers.
When you import the text file to an empty sheet though, Excel can ask you what to do. It tries to detect the codepage by checking for a BOM or a codepage that may be matching the file's contents and presents the guess in the import dialog box, together with a preview. The decimal and column separators are still those provided by your regional settings (can't guess those). UTF8 is generally easy to guess - the file starts with a BOM or contains NUL entries.
ASCII codepages are harder though. Saving my Greek file as ASCII results in a Japanese guess. That's English humour for you I guess.

To my surprise, trying to perform the request via a browser instead of using google's Advanced REST Client, clicking on the the file that is downloaded just works! Excel opens it correctly. So the problem must be with ARC.
In any case, since the process is not going to be done using an http client other than a browser... my problem is gone. Again, in ARC's output screen the file is displayed correctly. I do not know why upon clicking it to be opened in Excel it "gets corrupted".
Strange.

The binary contents of the file show a correctly utf-8 encoded CSV file with hebrew characters. If,a s you state in the comments, Excel does not allow you to change it's guessed file encoding when opening a CSV file, that is rather a misbehavior in Excel itself (call it a bug if you want).
Your options are: use LibreOffice (http://www.libreoffice.org/) which spreadsheet component does allow you to customize the settings for opening a CSV file.
Another one is to write a small program to explicitely convert your file to the encoding excel is expecting - if you have a Python3 interpreter installed, you could for example type:
python -c "open('correct.csv', 'wt', encoding='cp1255').write(open('utf8.csv', encoding='utf8').read())"
However, if your default Windows encoding is not cp1255 for handling Hebrew, as I suppose above, that won't help excel, but to give you different gibberish :-) In that case, you should resort to use programs that can correctly deal with different encodings.
(NB. there is a Python call to return the default system encoding in Windows, but I forgot which it is, and it is not easily googleable)

Encoding text file to appear on IBM Mainframe

I have a C++ program that sends data via FTP via ASCII mode to an IBM Mainframe. I am now doing this via C#.
When it gets there and viewed the file looks like garbage.
I cannot see anything in the C++ code that does anything special to encode the file into something like EPCDIC. When the C++ files are sent they are viewed ok. The only thing I see different is \015 & \012 for line feeds whereas C# is using \r\n.
Would these characters have an effect and if so how can I get my C# app to use \015?
Do I have to do any special encoding to make it appear ok?

It sounds like you should indeed be using an EBCDIC encoding, and then probably transferring the text in binary. I have an EBCDIC encoding class you can use, should you wish.
Note that \015\012 is \r\n - they're characters 13 and 10 in decimal, just different ways of representing them. If you think the C++ code really is producing the same files as C#, compare two files which should be the same in a binary file editor.

Make sure you have the TYPE TEXT instead of TYPE BINARY command before you transfer the file.

If you are truly sending the files in ASCII mode, then the mainframe itself will convert that to EBCDIC (it's receiver-makes-good).
The fact that you're getting apparent garbage at the mainframe end, and character codes \015 and \012 (which are CR and LF respectively) means that you're not transferring in ASCII mode.
As an aside, the ISPF editor has been able to view ASCII data sets for quite a few versions now. Open up the file and enter the commands source ascii and lf.
The first renders converts the characters from ASCII to EBCDIC so you can see what they are, the second goes through and pads out "lines" so that linefeed markers are replaced with enough spaces to reach the record length.
Invaluable commands when dealing with mixed-encoding environments, which is where I do a lot of my work.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.