I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?
First, open the file in binary mode and read it into memory.
For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:
private bool detectUTF8Encoding(string filename)
{
byte[] bytes = File.ReadAllBytes(filename);
try {
Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes);
return true;
} catch {
return false;
}
}
For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).
Use the StreamReader to identify the encoding.
Example:
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
Related
This is my function to convert the encoding of a file.
Before conversion I opened the file in Notepad++, and checked the Encoding using encoding menu, it shows that the Encoding is in UTF 8. I tried to convert the file using following function, but it did not convert to ASCII.
Please have a look into function.
public static void ConvertFileEncoding(string srcFile, Encoding srcEncoding, string tempFile)
{
try
{
using (var reader = new StreamReader(srcFile))
using (var writer = new StreamWriter(tempFile, false, Encoding.ASCII))
{
char[] buf = new char[1024];
while (true)
{
int count = reader.Read(buf, 0, buf.Length);
if (count == 0)
{
break;
}
writer.Write(buf, 0, count);
}
}
System.IO.File.Copy(tempFile, srcFile, true); // Source file is replaced with Temp file
DeleteTempFile(tempFile);
// TO DO -- Log Sucess Details
}
catch (Exception e)
{
throw new IOException("Encoding conversion failed.", e);
// TO DO -- Log failure Details
}
}
Please help me on understanding what wrong happenes when I convert the file without BOM to Windows-1252?
Characters that have values less than 128 in ASCII are all the same when encoded in UTF-8 or ASCII. If your file consists only of these (it is likely) then the file is identical as UTF-8 or ASCII.
A program can't be expected to distinguish these, because they are identical. UTF-8 is very commonly used now, so it's a reasonable choice when a program has no information other than the content of a file to guess from and it wants to display the encoding.
Im trying to get all bytes from a file but the issue is that i only have access to the file once so what im doing is saving the file´s string in a txt file so when i need to use it i read the saved file´s string and convert the string to a byte array but something is not working,
basicaly what i need is the equivalent of System.IO.File.ReadAllBytes(filePath) what is actually working good but is not what i need in this case.
i've tried this out so far
public byte[] getByteArray(string fileString)
{
Encoding utf8 = Encoding.utf8;
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
return utf8.GetBytes(fileString);
}
i have tried every encoding of that class but didn't work
I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.
Right now, I read in the file assuming Shift-JIS as such:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
String line = sr.ReadToEnd();
// Detection must be done AFTER you read from the file. Silly rabbit.
fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
codingFromBOM = sr.CurrentEncoding;
}
and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:
using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
String line = sr.ReadToEnd();
}
I am trying to avoid this re-read step and reinterperet the data in memory if possible.
Or is magic already happening and I am needlessly worrying about double I/O access?
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
text = Encoding.GetEncoding(932).GetString(buf);
}
I have a text file that is written in C# using ascii encoding and when I attempt to read the file using a java project I get a ZERO WIDTH NO-BREAK SPACE character at the beginning of the file. Has anybody ever had this happen to them?
private static void SavePrivateKey(object key)
{
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey - Begin");
string privatekey = (string)key;
string strDirName = Utility.RCTaskDirectory;
string strFileName = "PrivateKey.PPK";
string strKeyPathandName = Path.Combine(strDirName, strFileName);
//if (File.Exists(strKeyPathandName))
//{
// File.Create(strKeyPathandName);
//}
if (!string.IsNullOrEmpty(privatekey))
{//Save private key file
if (!Directory.Exists(strDirName))
Directory.CreateDirectory(strDirName);
FileStream fileStream = new FileStream(strKeyPathandName, FileMode.OpenOrCreate);
//TODO: Save File as ASCII
using (StreamWriter sw = new StreamWriter(fileStream, Encoding.ASCII))
{
if (logger.IsDebugEnabled) logger.DebugFormat("Saving the private key to {0}.", strKeyPathandName);
sw.Write(privatekey);
sw.Close();
if (logger.IsDebugEnabled) logger.DebugFormat("Saved private key to {0}.", strKeyPathandName);
}
}
if (logger.IsInfoEnabled) logger.Info("SavePrivateKey() - End");
}
It seems that the text is written with a BOM which is usually done when you write Unicode files... this specific character is the BOM for UTF16 files, so there must be something in your C# writing this file as UTF16...
see http://de.wikipedia.org/wiki/Byte_Order_Mark
As others have said, it is almost certainly a Unicode Byte Order Mark. If you have a look at the actual bytes in the file (not the characters) you can tell which encoding was used to write the file:
UTF-8 -> EF BB BF
UTF-16 BE -> FE FF
UTF-16 LE -> FF FE
Yes, it's quite normal, See Wikipedia. It's a optional character, which you simply should handle. So most likely you didn't write the file correctly as ASCII, since BOM only should appear if the file is encoded as unicode.
That's a Byte Order Mark indicating its a UTF-16 encoded text file.
Clearly its not writing the file in true ASCII, probably your code simply copying bytes, event though they are outside of the ASCII range. Can you post your code?
how to read the text below?
‰€ˆ‡‰�#îõ‘þüŠ ꑯõù ‚†ƒ� -#�ª÷‘þü “‘
ª“îù )øþ¦ùý ¤ª—ùý î‘õ•þø—¤(#•¢þ¢�
ø¤÷¢ù ꑯõù îõ‘þü#^a—ú¤�ö^b•¦øû÷¢ð‘ö
¤�ù ¢�÷©^cˆˆƒ�#‚€� «.: õ¬ø¤Š
›¢øñ#…�…ˆí/Š…/
…€�…Š}TK{^aˆˆƒ�#†/„€€#}BF{#ª“îùû‘ý
î‘õ•þø—¤ý#ª“îùû‘ý î‘õ•þø—¤ý --
�¥õøöû‘#^c
I use this code but not display all characters
FileStream fs = new FileStream(open.FileName, FileMode.Open, FileAccess.Read);
System.Text.Encoding enc = System.Text.Encoding.UTF8 ;
byte[] data = new byte[fs.Length];
fs.Read(data, 0, data.Length);
string text = enc.GetString(data);
and show text :
†‰€ˆ‡‰Â�#îõâ� �˜Ã¾Ã¼Å
ꑯõù ‚†ƒ� -#�ª÷‘þü
“‘ ª“îù )øþ¦ùý
¤ª—ùý î‘õ•þø—¤(#�
�¢þ¢� ø¤÷¢ù ꑯõù
îõ‘þü#^a—ú¤�ö
^b•¦øû÷¢ð‘ö ¤�ù
¢Â�֩^cˆˆƒÂ�#‚₠¬Â� «.:
õ¬ø¤Š›¢øñ#…Â�…ˆÃ/Å
…/ …€Â�…Š}TK{^aˆˆƒÂ�#â€
/„€€#}BF{#ª“îùû ‘ý
î‘õ•þø—¤ý#�
�“îùû‘ý î‘õ
this is a TEXT DOS
and encoding this text is:
IBM037
IBM437
IBM500
ASMO-708
DOS-720
ibm737
ibm775
ibm850
ibm852
IBM855
ibm857
IBM00858
IBM860
ibm861
DOS-862
IBM863
IBM864
IBM865
cp866
ibm869
IBM870
windows-874
cp875
shift_jis
gb2312
ks_c_5601-1987
big5
IBM1026
IBM01047
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
utf-16
unicodeFFFE
windows-1250
windows-1251
Windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
Johab
macintosh
x-mac-japanese
x-mac-chinesetrad
x-mac-korean
x-mac-arabic
x-mac-hebrew
x-mac-greek
x-mac-cyrillic
x-mac-chinesesimp
x-mac-romanian
x-mac-ukrainian
x-mac-thai
x-mac-ce
x-mac-icelandic
x-mac-turkish
x-mac-croatian
utf-32
utf-32BE
x-Chinese-CNS
x-cp20001
x-Chinese-Eten
x-cp20003
x-cp20004
x-cp20005
x-IA5
x-IA5-German
x-IA5-Swedish
x-IA5-Norwegian
us-ascii
x-cp20261
x-cp20269
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM423
IBM424
x-EBCDIC-KoreanExtended
IBM-Thai
koi8-r
IBM871
IBM880
IBM905
IBM00924
EUC-JP
x-cp20936
x-cp20949
cp1025
koi8-u
iso-8859-1
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
iso-8859-7
iso-8859-8
iso-8859-9
iso-8859-13
iso-8859-15
x-Europa
iso-8859-8-i
iso-2022-jp
csISO2022JP
iso-2022-jp
iso-2022-kr
x-cp50227
euc-jp
EUC-CN
euc-kr
hz-gb-2312
GB18030
x-iscii-de
x-iscii-be
x-iscii-ta
x-iscii-te
x-iscii-as
x-iscii-or
x-iscii-ka
x-iscii-ma
x-iscii-gu
x-iscii-pa
utf-7
utf-8
To read the file you need to know what encoding used in this file.
If you don't know, you can iterate through all encodings and see if find the one that works.
const string FileName = "FileName";
foreach (var encodingInfo in Encoding.GetEncodings())
{
try
{
var encoding = encodingInfo.GetEncoding();
var text = File.ReadAllText(FileName, encoding);
Console.WriteLine("{0} - {1}", encodingInfo.Name, text.Substring(0, 20));
// put break point and check if text is readable here
}
catch (Exception ex)
{
Console.WriteLine("Failed: {0}", encodingInfo.Name);
}
}
Disclaimer: assuming this is a text file, assuming the file isn't huge.
Well it looks like you're trying to open a .dat file, which is probably written with a byte format by the looks of it
Try the following code
File readThis = new File("file directory");
byte[] aByte = new byte[(int)readThis.length()];
FileInputStream Fis = new FileInputStream(readThis);
Fis.read(aByte);
System.out.println(Contents: "+aByte);
Fis.close();
Let me know how it goes :)