converting .txt files into unicode - c#

is there a way i can convert a .txt file into unicode by using c#?

Only if you know the original encoding used to produce the .txt file (and that's not a restriction of C# or the .NET language either, it's a general problem).
Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) to learn why "plain text" is meaningless if you don't know the encoding.

Provided you're only using ASCII characters in your text file, they're already Unicode, encoded as UTF-8.
In you want a different encoding of the characters (UTF16/UCS2, etc), any language that supports Unicode should be able to read in one encoding and write out another.
The System.Text.Encoding stuff will do it as per the following example - it outputs UTF16 as both UTF8 and ASCII and then back again (code gratuitously stolen from here).
using System;
using System.IO;
using System.Text;
class Test {
public static void Main() {
using (StreamWriter output = new StreamWriter("practice.txt")) {
string srcString = "Area = \u03A0r^2"; // PI.R.R
// Convert the UTF-16 encoded source string to UTF-8 and ASCII.
byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
byte[] asciiString = Encoding.ASCII.GetBytes(srcString);
// Write the UTF-8 and ASCII encoded byte arrays.
output.WriteLine("UTF-8 Bytes: {0}",
BitConverter.ToString(utf8String));
output.WriteLine("ASCII Bytes: {0}",
BitConverter.ToString(asciiString));
// Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded
// string and write.
output.WriteLine("UTF-8 Text : {0}",
Encoding.UTF8.GetString(utf8String));
output.WriteLine("ASCII Text : {0}",
Encoding.ASCII.GetString(asciiString));
Console.WriteLine(Encoding.UTF8.GetString(utf8String));
Console.WriteLine(Encoding.ASCII.GetString(asciiString));
}
}
}

Here is an example:
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
namespace utf16
{
class Program
{
static void Main(string[] args)
{
using (StreamReader sr = new StreamReader(args[0], Encoding.UTF8))
using (StreamWriter sw = new StreamWriter(args[1], false, Encoding.Unicode))
{
string line;
while ((line = sr.ReadLine()) != null)
{
sw.WriteLine(line);
}
}
}
}
}

There is a nice page on MSDN about this, including a whole example:
// Specify the code page to correctly interpret byte values
Encoding encoding = Encoding.GetEncoding(737); //(DOS) Greek code page
byte[] codePageValues = System.IO.File.ReadAllBytes(#"greek.txt");
// Same content is now encoded as UTF-16
string unicodeValues = encoding.GetString(codePageValues);

If you do really need to change the encoding (see Pax's answer about UTF-8 being valid Unicode), then yes, you can do that quite easily. Check out the System.Text.Encoding class.

Related

Why am I getting GZip compression size of a string more than the original size after compression when using SharpZipLib in C#

My string is a Json file (test.json) with the below content
{
"objectId": "bbad4cc8-bce8-438e-8683-3e603d746dee",
"timestamp": "2021-04-28T14:02:42.247Z",
"variable": "temperatureArray",
"model": "abc.abcdefg.abcdef",
"quality": 5,
"value": [ 43.471600438222104, 10.00940101687303, 39.925500606152, 32.34369812176735, 33.07786476010357 ]
}
I am compressing it as below
using ICSharpCode.SharpZipLib.GZip;
using System;
using System.Diagnostics;
using System.IO;
using System.Reflection;
using System.Text;
namespace GZipTest
{
public static class SharpZipLibCompression
{
public static void Test()
{
Trace.WriteLine("****************SharpZipLib Test*****************************");
var testFile = Path.Combine(Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location), "test.json");
var text = File.ReadAllText(testFile);
var ipStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(text);
var compressedString = CompressString(text);
var opStringSize = System.Text.UTF8Encoding.Unicode.GetByteCount(compressedString);
float stringCompressionRatio = (float)opStringSize / ipStringSize;
Trace.WriteLine("String Compression Ratio using SharpZipLib" + stringCompressionRatio);
}
public static string CompressString(string text)
{
if (string.IsNullOrEmpty(text))
return null;
byte[] buffer = Encoding.UTF8.GetBytes(text);
using (var compressedStream = new MemoryStream())
{
GZip.Compress(new MemoryStream(buffer), compressedStream, false);
byte[] compressedData = compressedStream.ToArray();
return Convert.ToBase64String(compressedData);
}
}
}
}
But my compressed string size (opStringSize) is more than the original string size (ipStringSize). Why?
Your benchmark has some fairly fundamental problems:
You're using UTF-16 to encode the input string to bytes when calculating its length (UTF8Encoding.Unicode is just an unclear way of writing Encoding.Unicode, which is UTF-16). That encodes to 2 bytes per character, but most of those bytes will be 0.
You're base64-encoding your output. While this is a way to print arbitrary binary data as text, it uses 4 characters to represent 3 bytes of data, so you're increasing the size of your output by 33%.
You're then using UTF-16 to turn the base64-encoded string into bytes again, which takes 2 bytes per character again. So that's an artificial 2x added to your result...
It so happens that the two uses of UTF-16 more-or-less cancel out, but the base64-encoding bit is still responsible for a lot of the discrepancies you're seeing.
Take that out, and you get a compression ratio of: 0.80338985.
That's not bad, given that compression introduces overheads: there's data which always needs to appear in a GZip stream, and it's there regardless of how well your data compresses. You can only really expect compression to make any significant difference on larger inputs.
See here.

Byte array read from a file and byte array converted from string read from same file differs

If i read byte array from a file and write it using below code
byte[] bytes = File.ReadAllBytes(filePath);
File.WriteAllBytes(filePath, byteArr);
works perfectly fine.I can open and view the written file properly.
But if i read file contents into a string and then convert it to byte array using below function
string s = File.ReadAllText(filePath);
var byteArr = System.Text.Encoding.UTF8.GetBytes(s);
the size of byte array is more than the previous array read directly from file and the values are also different, hence if i write the file using this array the cannot be read when opened
Note:- File is utf-8 encoded
i found out that using below code
using (StreamReader reader = new StreamReader(filePath, Encoding.UTF8, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
Unable to understand why both the array differs??
I was using the below attached image for converting and then writing
With
using (StreamReader reader = new StreamReader(filePath, Encoding.UTF8, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
your var encoding will just echo the Encoding.UTF8 parameter. You are deceiving yourself there.
A binary file just has no text encoding.
Need to save a file may be anything an image or a text
Then just use ReadAllBytes/WriteAllBytes. A text file is always also a byte[], but not all file types are text. You would need Base64 encoding first and that just adds to the size.
The safest way to convert byte arrays to strings is indeed encoding it in something like base64.
Like:
string s= Convert.ToBase64String(bytes);
byte[] bytes = Convert.FromBase64String(s);

How can I Change the encoding of a file 'without BOM' to an 'Windows - 1252' encoded file?

This is my function to convert the encoding of a file.
Before conversion I opened the file in Notepad++, and checked the Encoding using encoding menu, it shows that the Encoding is in UTF 8. I tried to convert the file using following function, but it did not convert to ASCII.
Please have a look into function.
public static void ConvertFileEncoding(string srcFile, Encoding srcEncoding, string tempFile)
{
try
{
using (var reader = new StreamReader(srcFile))
using (var writer = new StreamWriter(tempFile, false, Encoding.ASCII))
{
char[] buf = new char[1024];
while (true)
{
int count = reader.Read(buf, 0, buf.Length);
if (count == 0)
{
break;
}
writer.Write(buf, 0, count);
}
}
System.IO.File.Copy(tempFile, srcFile, true); // Source file is replaced with Temp file
DeleteTempFile(tempFile);
// TO DO -- Log Sucess Details
}
catch (Exception e)
{
throw new IOException("Encoding conversion failed.", e);
// TO DO -- Log failure Details
}
}
Please help me on understanding what wrong happenes when I convert the file without BOM to Windows-1252?
Characters that have values less than 128 in ASCII are all the same when encoded in UTF-8 or ASCII. If your file consists only of these (it is likely) then the file is identical as UTF-8 or ASCII.
A program can't be expected to distinguish these, because they are identical. UTF-8 is very commonly used now, so it's a reasonable choice when a program has no information other than the content of a file to guess from and it wants to display the encoding.

How can I get bytes to display their character representations from different codepages?

I've been trying to create a consistent method to take the bytes of characters and display the bytes representation in alternative text codepages. For example, hex D1 in Windows 1251, KOI-8U, etc. The idea is to take text that appears scrambled, because it is being interpreted and displayed in the wrong character set and transform it to the correct display. Below is a shortened portion of the code I've used. I've gotten it to work on ideone, but can't get it to work as an add-type in powershell or compiling with csc. I just get question marks or incorrect characters.
The output of the below code from ideone, which is the correct transformation, is:
D1-00-C1-00
СБ
windows-1251
When compiled with PowerShell or csc it is (incorrect):
D1-00-C1-00
?A
windows-1251
Is there a way to make this work in the Windows environment?
using System;
using System.Text;
public class Test
{
public static void Main()
{
string str = "ÑÁ"
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
Encoding enc = Encoding.GetEncoding(1251);
char[] ca = enc.GetChars(bytes);
Console.WriteLine(BitConverter.ToString(bytes));
Console.WriteLine(ca);
Console.WriteLine(enc.HeaderName);
}
}
First of all, the best way to solve this problem is to avoid it—make sure that when you have bytes, you always know which character set was used to encode these bytes.
To answer the question: you can't. There is no consistent method to make this work anywhere. This will always involve guesswork.
What you see is a string which was encoded to bytes with some encoding, and then decoded using a different encoding. Here is how you fix these strings:
figure out (or guess) what encoding was originally used to encode the string to bytes.
figure out (or guess) what encoding was used when displaying the string,
reverse the operations: encode the mojibake using the encoding from step (2) and encode the bytes with the encoding from step (1)
If you already have bytes, you only do step (1) and use that decoding to decode the bytes to a string.
A program doing that would look like this:
using System;
using System.Text;
public class Test
{
public static void Main()
{
// our corrupted string
string str = "ÑÁ"
// encoding from step (2)
Encoding enc1 = Encoding.GetEncoding(1252);
byte[] bytes = enc1.GetBytes(str);
// encoding from step (1)
Encoding enc2 = Encoding.GetEncoding(1251);
string originalStr = enc.GetString(bytes);
Console.WriteLine(originalStr);
}
}
UPDATE/SOLUTION
As roeland notes, there's quite a bit of guess work involved with this. The problem as a C# solution is also two parts on Windows. It appears that the display default encoding for the console doesn't change with the encoding object automatically (it seems to with Mac and the Mono Framework). The console display has to be set manually with SetConsoleCP and SetConsoleOutputCP. I also had to create multiple encodings and use an inner loop to get the proper intersection of codepages. The link below pointed towards the display issue's resolution.
UTF-8 output from PowerShell
Example below is focused on scenario where Russian is the suspected language.
CODE
using System;
using System.Text;
using System.Runtime.InteropServices;
namespace Language
{
public class Test
{
//Imports dll to set console display
[DllImport("kernel32.dll"]
public static extern bool SetConsoleCP(int codepage);
[DllImport("kernel32.dll"]
public static extern bool SetConsoleOutputCP(int codepage);
public static void Main()
{
string s = "ÑÁÅ";
byte[] bytes = new byte[s.Length * sizeof(char)];
System.Buffer.BlockCopy(s.ToCharArray(), 0, bytes, 0, bytes.Length);
Console.WriteLine(BitConverter.ToString(bytes);
//produce possible combinations
foreach (Encoding encw in Russian.GetCps())
{
bool cp = SetConsoleOutputCP(encw.CodePage);
bool cp2 = SetConsoleCP(encw.CodePage);
foreach (Encoding enc in Russian.GetCps())
{
char[] ca = enc.GetChars(bytes);
Console.WriteLine(ca);
}
}
}
}
public class Russian
{
public static Encoding[] GetCps()
{
// get applicable Cyrillic pages
Encoding[] = russian = new Encoding[8];
russian[0] = Encoding.GetEncoding(855);
russian[1] = Encoding.GetEncoding(866);
russian[2] = Encoding.GetEncoding(1251);
russian[3] = Encoding.GetEncoding(10007);
russian[4] = Encoding.GetEncoding(20866);
russian[5] = Encoding.GetEncoding(21866);
russian[6] = Encoding.GetEncoding(20880);
russian[7] = Encoding.GetEncoding(28595);
return russian;
}
}
}
The output is lengthy, but gives a string with the correct output as one member of a list.
I made a shorter version in PowerShell, which appears to change the display codepage automatically and requires less iterations:
function Get-Language ([string]$source) {
$encodings = [System.Text.Encoding]::GetEncoding(855),[System.Text.Encoding]::GetEncoding(866),[System.Text.Encoding]::GetEncoding(1251),[System.Text.Encoding]::GetEncoding(10007),[System.Text.Encoding]::GetEncoding(20866),[System.Text.Encoding]::GetEncoding(21866),[System.Text.Encoding]::GetEncoding(20880),[System.Text.Encoding]::GetEncoding(28595)
$C = ""
$bytes = gc $source -encoding byte
for ($i=0; $i -le $encodings.Length - 1; $i++) {
$bytes | %{$C = $C + $encodings[$i].GetChars($_)}
Write-Host $C
$C = ""
}
}

C# writes a ZERO WIDTH NO-BREAK SPACE at the beginning of a txt file

I have a text file that is written in C# using ascii encoding and when I attempt to read the file using a java project I get a ZERO WIDTH NO-BREAK SPACE character at the beginning of the file. Has anybody ever had this happen to them?
private static void SavePrivateKey(object key)
{
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey - Begin");
string privatekey = (string)key;
string strDirName = Utility.RCTaskDirectory;
string strFileName = "PrivateKey.PPK";
string strKeyPathandName = Path.Combine(strDirName, strFileName);
//if (File.Exists(strKeyPathandName))
//{
// File.Create(strKeyPathandName);
//}
if (!string.IsNullOrEmpty(privatekey))
{//Save private key file
if (!Directory.Exists(strDirName))
Directory.CreateDirectory(strDirName);
FileStream fileStream = new FileStream(strKeyPathandName, FileMode.OpenOrCreate);
//TODO: Save File as ASCII
using (StreamWriter sw = new StreamWriter(fileStream, Encoding.ASCII))
{
if (logger.IsDebugEnabled) logger.DebugFormat("Saving the private key to {0}.", strKeyPathandName);
sw.Write(privatekey);
sw.Close();
if (logger.IsDebugEnabled) logger.DebugFormat("Saved private key to {0}.", strKeyPathandName);
}
}
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey() - End");
}
It seems that the text is written with a BOM which is usually done when you write Unicode files... this specific character is the BOM for UTF16 files, so there must be something in your C# writing this file as UTF16...
see http://de.wikipedia.org/wiki/Byte_Order_Mark
As others have said, it is almost certainly a Unicode Byte Order Mark. If you have a look at the actual bytes in the file (not the characters) you can tell which encoding was used to write the file:
UTF-8 -> EF BB BF
UTF-16 BE -> FE FF
UTF-16 LE -> FF FE
Yes, it's quite normal, See Wikipedia. It's a optional character, which you simply should handle. So most likely you didn't write the file correctly as ASCII, since BOM only should appear if the file is encoded as unicode.
That's a Byte Order Mark indicating its a UTF-16 encoded text file.
Clearly its not writing the file in true ASCII, probably your code simply copying bytes, event though they are outside of the ASCII range. Can you post your code?

Categories

Resources