How to convert utf-8 to cp1252 in C#?

How to convert utf-8 to cp1252 in C#? - c#

I need to convert a XML file that is being sent on utf-8 to cp1252. How is this possible?
My consumer application only support ANSI characters in it's controls and not Unicode. It also does a memcopy of the bytes received.

Try the Encoding.Convert method. See here: http://msdn.microsoft.com/en-us/library/kdcak6ye(v=vs.100).aspx

Got an anwser from jonp from TEAM XAMARIN - http://forums.xamarin.com/discussion/337/how-to-send-byte-encoded-as-ansi
It turns out "You need to add the West internationalization assemblies to your application.":
By default, Mono for Android only includes a limited list of collation
tables and Encodings. In order to include additional data into your
application, you can use the MandroidI18n property:
West

Related

DynamoDB automatically converting special characters

Working with DynamoDB and AWS (.net C#). For some reason when saving strings containing "é" get replaced with a question mark when saved.
How can I prevent it from happening?

DynamoDB stores strings in UTF-8 encoding. Somewhere in the your application you must be assigning that string in something other than UTF-8.
Im using Java (which uses UTF-16). I don't do anything special when storing strings. I just tried storing and retrieving "é" in DynamoDB using the Java SDK and there was no problem.

Just to add to the previous answer, your IDE will generally have encoding settings. You'll want to change the string encoding for your project to UTF-8 to minimize the changes of an encoding error, which can create what is thought to be an unknown string.
For example, in Eclipse editors, you can see this answer to change your encoding.

Read non-Unicode, non-English text from SQLite database

We have an old MFC c++ application that writes into a SQLite database and another C# application that reads from the database.
The C++ application writes "шаг потока работы" (Cyrillic characters) in the database from a Russian Windows computer.
When the same C++ application reads it on a Western European (Latin) Windows computer, it reads it as "øàã ïîòîêà ðàáîòû" (Latin representation).
When the C# application reads it, it reads it as "��� ������ ����" (Unicode representation).
None of these applications specify encoding type in the database. I want to read the original text in C#.
I couldn't find a proper way to specify the Encoding type when reading the text.
I've tried connection strings such as Data Source=c:\mydb.db;Version=3;UTF8Encoding=True; but no luck so far.
Also I tried to get the byte array from "��� ������ ����" and convert to Cyrillic but failed.
Does anyone happen know how to read the original Russian text back from a SQLite database?

All the normal functions in the SQLite C API use UTF-8. The C# SQLite driver automatically converts between UTF-8 and the C# string encoding.
If you do not get correct data from your C# program, then it's likely that the C++ application did not actually UTF-8. This is confirmed by the fact that the C++ application gives different results with different code pages.
If possible, fix the C++ application, or convert the data in the database from the original encoding to UTF-8. As a last resort, you could change your C# application to read all the strings as blobs and then convert them from the original encoding to UTF-8. (In any case, you need to know what the original encoding is.)

Unzipping with ExtractToDirectory method distorts non-latin symbols

I have several folders with files, some folders contain non-latin symbols in their names (russian in my case). This folders are sending to zip archive (by windows explorer) in "D:\test.zip".
Then I execute method
ZipFile.ExtractToDirectory(#"D:\test.zip", #"D:\result");
and it successfully unzip all content, but all non-latin symbols turn into something wrong.
For example, instead of "D:\result\каскады\file.txt" I got "D:\result\Є бЄ ¤л\file.txt".
Default encoding of my system is windows-1251 which I verified by involving Encoding.GetEncoding("windows-1251") into third parameter of ExtractToDirectory and getting the same result. I also tried UTF-8, but got another artifacts inside path ("D:\result\��᪠��\file.txt"). Trying Unicode return me message about not supported encoding.
When I create same archive through the code by executing method
ZipFile.CreateFromDirectory(#"D:\zipdata", #"D:\test.zip");
everything then unzipping fine with the same line of code as in the top of the question, even without specifying particular encodings.
The question is: how to get correct encoding from archive for applying it in ExtractToDirectory method, in respect that in real task archive comes from external source and I can not rely on wherether it created 'by hands' or programmatically?
Edit
There is question where also non-latin symbols (chinese) cause problems, but this fact was given like resolution of question, whereas this is exactly problem for my situation.

There is no formally standardized ZIP specification. However, the de facto standard is the PKZIP "application note" document, which as of 2006 documents only code page 437 ("OEM United States") and UTF8 as legal text encodings for file entries in the archive:
D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.
D.2 If general purpose bit 11 is unset, the file name and comment should conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment must support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).
In other words, it's a bug in any ZIP authoring tool to use any text encoding other than code page 437 or UTF8. Based on your experience, it appears Windows Explorer has this bug. :(
Unfortunately, the "general purpose bit 11" is the only official mechanism for indicating the actual text encoding used in the archive, and this allows only for either the original 437 code page or UTF8. Even this bit was not supported by .NET until .NET 4.5. In any case, even since that time it is not possible for .NET or any other ZIP archive-aware software to reliably determine a non-standard, unsupported encoding used to encode the file entry names in the archive.
However, you can, if the source machine used to create the archive is known and available, determine the default code page installed on that machine, via the CultureInfo class. The following expression will return the code page identifier installed on the machine where the expression is executed (assuming the process hasn't changed its current culture from the default, of course):
System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage
This gives you the code page ID that can be passed to Encoding.GetEncoding(Int32) to retrieve an Encoding object that can then be passed to the appropriate ZipArchive constructor when opening an existing archive, to ensure that the file entry names are decoded correctly.
If you are unable to retrieve the actual text encoding from the machine that is the origin of the archive, then you're stuck enumerating the encodings, trying each one until you find one that reports entry names in a legible format.
As I understand it, Windows 8 and later can support the UTF8 flag in the ZIP archive file. I haven't tried it, but it's possible that such versions of Windows also write archives using that flag. If so, that would (one hopes) mitigate the pain of the earlier Windows bug.
Finally note that a custom tool could record the encoding in a special file entry placed in the archive itself. Of course, only that tool would be able to recognize the special file and use it to determine the correct encoding (the tool would have to open the archive twice: once to retrieve the file, and then a second time once the tool has determined the encoding). This is not an ideal solution and of course is no help for archives created by Windows Explorer. I mention it only for the sake of completeness.

How to make C# show arabic?

I have a problem that while writing a C# code the output sometimes is words arabic language,and it appears as a strange symbols,how to make C# read and show arabic??

I don't know the precise problem you're having, but would suggest you read The Absolute Minimum Every Programmer Should Know About Unicode to give yourself a solid grounding in this often confusing topic.

Arabic Console Output/Input is not possible on Windows Platforms, according to Microsoft:
http://www.microsoft.com/middleeast/msdn/arabicsupp.aspx#12

C#/.NET will display Arabic characters without a problem, as it represents string internally as UTF-16.
The issue is with how you display the characters.
If you are on the web, you need to ensure that your are including the correct charset encoding header or meta tag for the output.
Please provide more information on where you don't see the characters, and how you are outputting the strings.

maybe it is a problem with your system language, go to Control Panel then to Language options and try to change you System Local Language to Arabic and ensure that the language for non-Unicode programs is arabic.

Please make sure that you have correct fonts installed. If you have them on your system, it could be a fallback mechanism problem.
For web pages (Asp.Net), please make sure that:
You are using (and declaring) correct encoding.
You have correct fonts declared in your style definition.
I know that it sounds strange, but for Internet Explorer, it helps to set language for non-Unicode programs to what you need to support (in my case it was Chinese Simplified on German Windows 2003).

Unicode strings in my C# App are shown with question marks

I have a header file in C++/CLI project, which contains some strings in different languages.
arabic, english, german, chinese, french, japanese etc...
I have a second project written in C#.
Here I access the strings stored in the header file of the C++/CLI project.
The encoding of the header file is Unicode - Codepage 1200 or UTF-8.
the visual studio editor is able to display the strings correctly.
At runtime I access these strings and assign them into a local String variable.
Here I recognized that many strings are not shown correctly. Doesn't matter if I assign them or not. Accessing the original place (while debugging) shows me all the foreign strings with question marks. Especially chinese, just question marks.
Example : "So?e St?ange ?ext in Ch?n?se"
(This is not the best example, I know)
What is the problem?
I read that C# is by default UTF-16,
My header file containing the strings is UTF-16 or UTF-8.
I must be able to handle strings in different languages. What am I doing wrong?

The '?" means that the text is read as Unicode(UTF16) and somehwere there is a conversion to your current code page. Since your current codepage is NOT chinese the chinese chararcters will get transformed to '?'
It would be helpful to see the code.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.