Preserve encoding of each byte while converting from stream to string - c#

I have a scenario where I have data in a stream encoded with mix of encodings like utf-8, shift-jis etc. Some of the field are encoded with utf-8 and some with different encoding.
While converting from stream to string using StreamReader class, its converting everything taking one default encoding and hence we are sometimes loosing some chars in string.
I am using below lines for the conversion
StreamReader reader = new StreamReader(stream,Encoding.Default);
string s = reader.ReadToEnd();
but I am loosing some data in string.

Related

streamwriter does not save unicode files correctly

I'm opening a text file and removing the first line to prepare it for importing in a database using bulk insert. Here is my code:
string tempFile = Path.GetTempFileName();
using (var sr = new StreamReader("F:\\Upload\\File.txt", System.Text.Encoding.UTF8))
{
using (var sw = new StreamWriter(tempFile,true, System.Text.Encoding.UTF8))
{
string line;
while ((line = sr.ReadLine()) != null)
{
if (line.Substring(0, 8) != "Nr. Crt.")
sw.WriteLine(line);
}
}
}
System.IO.File.Delete("F:\\Upload\\File.txt");
System.IO.File.Move(tempFile, "F:\\Upload\\File.txt");
After this if I open the resulting file, Unicode characters are replaced with other characters. For example strings containing non-breaking space (unicode U+00A0): Value  (note the unicode char ) are transformed in Value� .
How can I avoid this?
Edit:
Notepad++ is set to 'Encode in UTF-8'
Here is a picture of how it looks :
are transformed in Value�
The byte values for those 3 odd characters are 0xef 0xbd 0xbf. Which is the utf8 encoding for codepoint \ufffd, the replacement character �. Which is used when reading utf encoded text and the text contains an invalid encoding byte sequence.
Pointing squarely at an issue with File.txt, it was probably not encoded in utf-8. If you have no idea what encoding was used for that file then the first guess is to pass Encoding.Default to the StreamReader constructor.
It looks to me like it is writing fine, but the tool you are reading with is not expecting UTF-8. In many cases, you need to explicitly tell the tool what encoding to expect. However, a common approach is to prepend a BOM ("byte order mark"). This is simple - just use new UTF8Encoding(true) as the encoding and it will happen automatically. In tools that don't expect a BOM this will display as a few mangled chars at the start - but most modern tools will know what it means, and will switch to UTF-8 automatically. The point is: the BOM for UTF-8, UTF-16 LE and UTF-16 BE etc are all slightly different, but recognisable. A more complete list is on wikipedia.

C# - converting UTF-8 to Ukranian encoding

I was trying to convert the encoding of this string from utf-8 to ukranian "ÐÑайвеÑ-длÑ-пÑинÑеÑа-Pixma-ip-2000-длÑ-Windows-7-64-биÑ".
whenever I convert it from utf8 to ukranian I get a corrupted string...
the correct string should look like "Драйвер-для-принтера-Pixma-ip-2000-для-Windows-7-64-бит"..
please advice.. thanks
EDIT: here is how I convert it..
private string EncodeUTF8toOther(string inputString, string to)
{
try
{
// Create two different encodings.
byte[] myBytes = Encoding.Unicode.GetBytes(inputString);
// Perform the conversion from one encoding to the other.
byte[] convertedBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(to), myBytes);
return Encoding.GetEncoding("ISO-8859-1").GetString(convertedBytes);
}
catch
{
return inputString;
}
}
ukrainian character set is "KOI8-U"
More Info: I have similar problem to this question:
c# HttpWebResponse Header encoding
the location header is giving me this corrupted string. I need to encode it correctly in order to perform the redirection..
Encoding.Unicode is UTF-16, not UTF-8. If you're sure your source string is encoded in UTF-8, use Encoding.UTF8 instead.
And returning a string doesn't have any sense. string are always encoded in UTF-16. You should worry about the encoding only when reading and writing your string.
When reading, use Encoding.UTF8.GetString to create a UTF-16 string from the binary data.
When writing, either use Encoding.GetEncoding(destinationEncoding).GetBytes to get the binary data and write it directly, or use the overload of your StreamWriter constructor (or whatever object you're using) to specify the encoding.
You need to decode the string properly on input, like so:
StreamReader rdr = new StreamReader( args[0], Encoding.UTF8 );
string str = rdr.ReadToEnd();
rdr.Close();
The stream is physical and you must know what encoding it is in.
The string, on the other hand, is logical.
The encoding used for strings internally is of no concern to you;
other than that what characters it can represent;
and it can represent all characters as the internal encoding is for Unicode.
(If the internal encoding were KOI-8 German or French characters couldn't be represented.)
It is on output that you have to worry again about the encoding.
If you don't specify the encoding on input and output the platform default is assumed.
This might not be what you want.
It's good practice to know and specify the encoding on input and output.
"ÐÑайвеÑ-длÑ-пÑинÑеÑа-Pixma-ip-2000-длÑ-Windows-7-64-биÑ".
Its already UTF-8! You don't have to make any conversion. Just make Windows know its UTF-8. Something like this will do the job:
wb.Encoding = Encoding.UTF8;

How to write some text as bytes (without encoding)

I need to generate a file that will be read by another system. For this reason, it should be in binary, not text with some encoding.
Here's the code I'm using:
using (FileStream stream = new FileStream(fileName, FileMode.Create))
{
using (BinaryWriter writer = new BinaryWriter(stream))
{
writer.Write("Some text" + Environment.NewLine);
writer.Write("Some more text" + Environment.NewLine);
}
}
When I open the file and look at it, I can see some special character at the start of each line, similar to this (hard to paste it here, since it doesn't show the same):
~Some text
~Some more text
What am I doing wrong/forgetting?
Thanks for your help.
There's no such concept as text without an encoding. It's like wanting to save an abstract image to disk without specifying any image format. (Even "raw" is a kind of encoding for images - you need to agree on a way of communicating the width, height, byte order, colour depth somehow.)
I suggest you just fix on one encoding (e.g. Encoding.Unicode or Encoding.UTF8) and write the text that way.
As for why BinaryWriter.Write(text) is putting "special characters" at the start of each line, did you check the documentation for what it does?
Writes a length-prefixed string to this stream in the current encoding of the BinaryWriter, and advances the current position of the stream in accordance with the encoding used and the specific characters being written to the stream.
and
A length-prefixed string represents the string length by prefixing to the string a single byte or word that contains the length of that string. This method first writes the length of the string as a UTF-7 encoded unsigned integer, and then writes that many characters to the stream by using the BinaryWriter instance's current encoding.
So what you're seeing is the length-prefix... but then it will use whatever encoding you've set up for the BinaryWriter.

Converting a binary file to a string and vice versa

I created a webservice which returns a (binary) file. Unfortunately, I cannot use byte[] so I have to convert the byte array to a string.
What I do at the moment is the following (but it does not work):
Convert file to string:
byte[] arr = File.ReadAllBytes(fileName);
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
string fileAsString = enc.GetString(arr);
To check if this works properly, I convert it back via:
System.Text.UnicodeEncoding enc = new System.Text.UnicodeEncoding();
byte[] file = enc.GetBytes(fileAsString);
But at the end, the original byte array and the byte array created from the string aren't equal. Do I have to use another method to read the file to a byte array?
Use Convert.ToBase64String to convert it to text, and Convert.FromBase64String to convert back again.
Encoding is used to convert from text to a binary representation, and from a binary representation of text back to text again. In this case you don't have a binary representation of text - you just have arbitrary binary data... so Encoding is inappropriate. Even if you use an encoding which can "sort of" handle any binary data (e.g. ISO Latin 1) you'll find that many ways of transmitting text will fail when you've got control characters etc.
Base64 encoding will give you text which is just ASCII, and much easier to handle.

Converting text file from ANSI to ASCII using C#

I have an ANSI-encoded file, and I want to convert the lines I read from the file to ASCII.
How do I go about doing this in C#?
EDIT : What if i used "BinaryReader"
BinaryReader reader = new BinaryReader(input, Encoding.Default);
but this reader takes (Stream, Encoding)
but "Stream" is an abstract! And where should I put the path of the file which he will read from ?
A direct conversion from ANSI to ASCII might not always be possible, since ANSI is a superset of ASCII.
You can try converting to UTF-8 using Encoding, though:
Encoding ANSI = Encoding.GetEncoding(1252);
byte[] ansiBytes = ANSI.GetBytes(str);
byte[] utf8Bytes = Encoding.Convert(ANSI, Encoding.UTF8, ansiBytes);
String utf8String = Encoding.UTF8.GetString(utf8Bytes);
Of course you can replace UTF8 with ASCII, but that doesn't really make sense since:
if the original string doesn't contain any byte > 126, then it's already ASCII
if the original string does contain one or more bytes > 126, then those bytes will be lost
UPDATE:
In response to the updated question, you can use BinaryReader like this:
BinaryReader reader = new BinaryReader(File.Open("foo.txt", FileMode.Open),
Encoding.GetEncoding(1252));
Basically, you need to specify an Encoding when reading/writing the file. For example:
// read with the **local** system default ANSI page
string text = File.ReadAllText(path, Encoding.Default);
// ** I'm not sure you need to do this next bit - it sounds like
// you just want to read it? **
// write as ASCII (if you want to do this)
File.WriteAllText(path2, text, Encoding.ASCII);
Note that once you have read it, text is actually unicode when in memory.
You can choose different code-pages using Encoding.GetEncoding.

Categories

Resources