Unzipping with ExtractToDirectory method distorts non-latin symbols

Unzipping with ExtractToDirectory method distorts non-latin symbols - c#

I have several folders with files, some folders contain non-latin symbols in their names (russian in my case). This folders are sending to zip archive (by windows explorer) in "D:\test.zip".
Then I execute method
ZipFile.ExtractToDirectory(#"D:\test.zip", #"D:\result");
and it successfully unzip all content, but all non-latin symbols turn into something wrong.
For example, instead of "D:\result\каскады\file.txt" I got "D:\result\Є бЄ ¤л\file.txt".
Default encoding of my system is windows-1251 which I verified by involving Encoding.GetEncoding("windows-1251") into third parameter of ExtractToDirectory and getting the same result. I also tried UTF-8, but got another artifacts inside path ("D:\result\��᪠��\file.txt"). Trying Unicode return me message about not supported encoding.
When I create same archive through the code by executing method
ZipFile.CreateFromDirectory(#"D:\zipdata", #"D:\test.zip");
everything then unzipping fine with the same line of code as in the top of the question, even without specifying particular encodings.
The question is: how to get correct encoding from archive for applying it in ExtractToDirectory method, in respect that in real task archive comes from external source and I can not rely on wherether it created 'by hands' or programmatically?
Edit
There is question where also non-latin symbols (chinese) cause problems, but this fact was given like resolution of question, whereas this is exactly problem for my situation.

There is no formally standardized ZIP specification. However, the de facto standard is the PKZIP "application note" document, which as of 2006 documents only code page 437 ("OEM United States") and UTF8 as legal text encodings for file entries in the archive:
D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.
D.2 If general purpose bit 11 is unset, the file name and comment should conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment must support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).
In other words, it's a bug in any ZIP authoring tool to use any text encoding other than code page 437 or UTF8. Based on your experience, it appears Windows Explorer has this bug. :(
Unfortunately, the "general purpose bit 11" is the only official mechanism for indicating the actual text encoding used in the archive, and this allows only for either the original 437 code page or UTF8. Even this bit was not supported by .NET until .NET 4.5. In any case, even since that time it is not possible for .NET or any other ZIP archive-aware software to reliably determine a non-standard, unsupported encoding used to encode the file entry names in the archive.
However, you can, if the source machine used to create the archive is known and available, determine the default code page installed on that machine, via the CultureInfo class. The following expression will return the code page identifier installed on the machine where the expression is executed (assuming the process hasn't changed its current culture from the default, of course):
System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage
This gives you the code page ID that can be passed to Encoding.GetEncoding(Int32) to retrieve an Encoding object that can then be passed to the appropriate ZipArchive constructor when opening an existing archive, to ensure that the file entry names are decoded correctly.
If you are unable to retrieve the actual text encoding from the machine that is the origin of the archive, then you're stuck enumerating the encodings, trying each one until you find one that reports entry names in a legible format.
As I understand it, Windows 8 and later can support the UTF8 flag in the ZIP archive file. I haven't tried it, but it's possible that such versions of Windows also write archives using that flag. If so, that would (one hopes) mitigate the pain of the earlier Windows bug.
Finally note that a custom tool could record the encoding in a special file entry placed in the archive itself. Of course, only that tool would be able to recognize the special file and use it to determine the correct encoding (the tool would have to open the archive twice: once to retrieve the file, and then a second time once the tool has determined the encoding). This is not an ideal solution and of course is no help for archives created by Windows Explorer. I mention it only for the sake of completeness.

Related

Web API action returns FileContentResult that, if saved as .csv, will open as gibberish , while if as .txt, is ok. Why?

I am exporting a file via a http get response, using ASP.NET Web API.
For that, I am returning a FileContentResult object, as in:
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
After several minutes stucked with encoding issues, I use google's Advanced REST Client to perform the get to the web api controller's action, and the file is being download just ok.
Well, not exactly. I originally wanted it to be sent/downloaded as a .csv file.
If I set the http request content-type to "text/csv" and the File() call sets the response's content type to "text/csv" just as well, Advanced REST Client will show the contents properly, but excel will open it as gibberish data.
If I simply change the content-type to "text/plain", save it as a .txt file (have to rename it after saving, don't know why it is being saved as _.text-plain, while as a csv it is being saved with .csv extension), and finally perform an import in Excel like described here Excel Import Text Wizard, then then excel opens the file correctly.
Why is the .csv being opened as gibberish, while as a .txt it is not ? For opening a .csv, there is no import wizard like with a .txt file (not that I am aware of).
Providing a bit of the source below:
StringBuilder fileContents = new StringBuilder();
//csv header
fileContents.AppendLine(String.Join(CultureInfo.CurrentCulture.TextInfo.ListSeparator, fileData.Select(fileRecord => fileRecord.Name)));
//csv records
foreach (ExportFileField fileField in fileData)
fileContents.AppendLine(fileField.Value);
return File(Encoding.UTF8.GetBytes(fileContents.ToString()), "text/plain; charset=UTF-8");
As requested, the binary contents of both files.
The text-plain (.txt) version (the one that will open in excel, using import):
and the .csv one (the one that excel will open with junk data):
The (files are the same, the cropping of the screen shots was not the same...)

I was able to reproduce the issue by saving a file containing Greek characters with BOM. Double clicking attempts to import the file using the system's locale (Greek). When manually importing, Excel detects the codepage and offers to use the 65001 (UTF8) codepage.
This behavior is strange but not a bug. Text files contain no indication that would help detect their codepage, nor is it possible to guess. An ASCII file containing only A-Z characters saved as 1252 is identical to one saved using 1253. That's why Windows uses the system codepage, which is the local used for all non-Unicode programs and files.
When you double click on a text file, Excel can't ask you for the correct encoding - this could get tedious very quickly. Instead, it opens the file using your regional settings and the system codepage. ASCII files created on your machine are saved using your system's codepage so this behaviour is logical. Files given to you by non-programmers will probably be saved using your country's codepage as well. Programmers typically switch everything to US English and that's how problems start. Your REST client may have saved the text as ASCII using the Latin encoding used by most programmers.
When you import the text file to an empty sheet though, Excel can ask you what to do. It tries to detect the codepage by checking for a BOM or a codepage that may be matching the file's contents and presents the guess in the import dialog box, together with a preview. The decimal and column separators are still those provided by your regional settings (can't guess those). UTF8 is generally easy to guess - the file starts with a BOM or contains NUL entries.
ASCII codepages are harder though. Saving my Greek file as ASCII results in a Japanese guess. That's English humour for you I guess.

To my surprise, trying to perform the request via a browser instead of using google's Advanced REST Client, clicking on the the file that is downloaded just works! Excel opens it correctly. So the problem must be with ARC.
In any case, since the process is not going to be done using an http client other than a browser... my problem is gone. Again, in ARC's output screen the file is displayed correctly. I do not know why upon clicking it to be opened in Excel it "gets corrupted".
Strange.

The binary contents of the file show a correctly utf-8 encoded CSV file with hebrew characters. If,a s you state in the comments, Excel does not allow you to change it's guessed file encoding when opening a CSV file, that is rather a misbehavior in Excel itself (call it a bug if you want).
Your options are: use LibreOffice (http://www.libreoffice.org/) which spreadsheet component does allow you to customize the settings for opening a CSV file.
Another one is to write a small program to explicitely convert your file to the encoding excel is expecting - if you have a Python3 interpreter installed, you could for example type:
python -c "open('correct.csv', 'wt', encoding='cp1255').write(open('utf8.csv', encoding='utf8').read())"
However, if your default Windows encoding is not cp1255 for handling Hebrew, as I suppose above, that won't help excel, but to give you different gibberish :-) In that case, you should resort to use programs that can correctly deal with different encodings.
(NB. there is a Python call to return the default system encoding in Windows, but I forgot which it is, and it is not easily googleable)

How to work with libraries that don't support unicode characters

In my case, I recently picked up the irrKlang library which allows me to work with audio files without doing too much work. Then I ran into the issue where unicode characters in filepaths were not supported by the library. It either reads it incorrectly (I would've thought even if it was read wrong, it could still find the file), or simply ignores it, leaving me with invalid file paths.
Searched their support forums for a solution to this, but all I got out of it was "unicode? uhhh why not just use ascii?" kind of attitude towards unicode, which I suppose is not uncommon.
What are some techniques that I could use to reliably pass unicode strings to libraries that don't have unicode support?

Simply put you don't, you can pass through them using a byte array, and then interpret it back as a unicode array on the other end, but if it doesn't do unicode, it doesn't do unicode.
There is no point in passing a unicode string to a library that is incapable of interpreting it.
If you need to do something specific(like using a load command on a filesystem with unicode paths e.g HFS+), then don't. Rather use the system provided file APIs and push the data into the uncooperative libraries constructor.
If you're seriously having problems with this unicode file path business, cause you don't work well with passing addresses and bitstreams around, then a simple solution is to make your own function:
obj_ptr* loadObjFromUnicodePath(path)
{
//create tmp ASCII named symlink to file at arg(path).
//call load API of irrKlang on symlink.
//delete tmp symlink, return object.
}

Microsoft IDEs, source file encodings, BOMs and the Unicode character \uFEFF?

We have parsers for various Microsoft languages (VB6, VB.net, C#, MS dialects of C/C++).
They are Unicode enabled to the extent that we all agree on what Unicode is. Where we don't agree, our lexers object.
Recent MS IDEs all seem to read/write their source code files in UTF-8... I'm not sure this is always true. Is there some reference document that makes it clear how MS will write a souce code file? With or without byte order marks? Does it vary from IDE version to version? (I can't imagine that the old VB6 dev environment wrote anything other than an 8 bit character set, and I'd guess it would be in the CP-xxxx encoding established by the locale, right?)
For C# (and I assume other modern language dialects supported by MS), the character code \uFEFF can actually be found in the middle of a file. This code is defined as a zero-width no-break space. It appears to be ignored by VS 2010 when found in the middle of an identifier, in whitespace, but is significant in keywords and numbers. So, what are the rules? Or does MS have some kind of normalize-identifiers to handle things like composite characters, that allows different identifier strings to be treated as identical?

This is in a way a non-answer, because it does not tell what Microsoft says but what the standards say. Hope it will be of assistance anyway.
U+FEFF as a regular character
As you stated, U+FEFF should be treated as BOM (byte order mark) in the beginning of a file. Theoretically it could also appear in the middle of text since it actually is character denoting a zero width non-breaking space (ZWNBSP). In some languages/writing systems all words in a line are joined (=written together) and in such cases this character could be used as a separator, just like regular space in English but it does not cause a typographically visible gap. I'm not actually familiar with such scripts so my view might not be fully correct.
U+FEFF should only appear as a BOM
However, the usage of U+FEFF as a ZWNBSP has been deprecated as of Unicode version 3.2 and currently the purpose of U+FEFF is to act as a BOM. Instead of ZWNBSP as a separator, U+2060 (word joiner) character is strongly preferred by the Unicode consortium. Their FAQ also suggests that any U+FEFF occurring in the middle of a file can be treated as an unsupported character that should be displayed as invisible. Another possible solutions that comes into my mind would be to replace any U+FEFF occurring in the middle of a file with U+2060 or just ignore it.
Accidentally added U+FEFF
I guess the most probable reason for U+FEFF to appear in the middle of text is that it is a an erroneous result (or side effect) of a string concatenation. RFC 3629, that incorporated the usage of a BOM, denotes that stripping of the leading U+FEFF is necessary in concatenating strings. This also implies that the character could just be removed when found in middle of text.
U+FEFF and UTF-8
U+FEFF as a BOM has no real effect when the text is encoded as UTF-8 since it always has the same byte order. BOM in UTF-8 interferes with systems that rely on the presence of certain leading characters and protocols that explicitly mandate the encoding or an encoding identification method. Real world experience has also showed that some applications choke on UTF-8 with BOM. Therefore the usage of a BOM is generally discouraged when using UTF-8. Removing BOM from an UTF-8 encoded file should should not cause incorrect interpretation of the file (unless there is some checksum or digital signature related to the byte stream of the file).

On "how MS will write a souce code file" : VS can save files with and without BOM, as well in whole bunch of other encodings. The default is UTF-8 with BOM. You can try it yourself by going File -> Save ... as -> click triangle on "Save" button and chose "save with encoding".
On usage of FEFF in actual code - never seen one using it in the code... wikipedia suggests that it should be treated as zero-width space if happened anywhere but first position ( http://en.wikipedia.org/wiki/Byte_order_mark ).

For C++, the file is either Unicode with BOM, or will be interpreted as ANSI (meaning the system code page, not necessarily 1252). Yes, you can save with whatever encoding you want, but the compiler will choke if you try to compile a Shift-JIS file (Japanese, code page 932) on an OS with 1252 as system code page.
In fact, even the editor will get it wrong. You can save it as Shift-JIS on a 1252 system, and will look ok. But close the project and open it, and the text looks like junk. So the info is not preserved anywhere.
So that's your best guess: if there is no BOM, assume ANSI. That is what the editor/compiler do.
Also: VS 2008 and VS 2010, older editors where no to Unicode friendly.
And C++ has different rules than C# (for C++ the files are ANSI by default, for C# they are utf-8)

Encoding text file to appear on IBM Mainframe

I have a C++ program that sends data via FTP via ASCII mode to an IBM Mainframe. I am now doing this via C#.
When it gets there and viewed the file looks like garbage.
I cannot see anything in the C++ code that does anything special to encode the file into something like EPCDIC. When the C++ files are sent they are viewed ok. The only thing I see different is \015 & \012 for line feeds whereas C# is using \r\n.
Would these characters have an effect and if so how can I get my C# app to use \015?
Do I have to do any special encoding to make it appear ok?

It sounds like you should indeed be using an EBCDIC encoding, and then probably transferring the text in binary. I have an EBCDIC encoding class you can use, should you wish.
Note that \015\012 is \r\n - they're characters 13 and 10 in decimal, just different ways of representing them. If you think the C++ code really is producing the same files as C#, compare two files which should be the same in a binary file editor.

Make sure you have the TYPE TEXT instead of TYPE BINARY command before you transfer the file.

If you are truly sending the files in ASCII mode, then the mainframe itself will convert that to EBCDIC (it's receiver-makes-good).
The fact that you're getting apparent garbage at the mainframe end, and character codes \015 and \012 (which are CR and LF respectively) means that you're not transferring in ASCII mode.
As an aside, the ISPF editor has been able to view ASCII data sets for quite a few versions now. Open up the file and enter the commands source ascii and lf.
The first renders converts the characters from ASCII to EBCDIC so you can see what they are, the second goes through and pads out "lines" so that linefeed markers are replaced with enough spaces to reach the record length.
Invaluable commands when dealing with mixed-encoding environments, which is where I do a lot of my work.

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.
But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.
The question is,
How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?
I've looked online but cannot find a satisfactory answer.

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:
StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);
The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.
You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.
The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.
I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.
In practice, I've found the following to work for most of what I do:
StreamReader reader = new StreamReader("filename", Encoding.Default, true);
Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.