Effective way to find any file's Encoding

Effective way to find any file's Encoding - c#

Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding.
So precise as Notepad++ is.

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.
*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE
// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

The following code works fine for me, using the StreamReader class:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
The trick is to use the Peek call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX call before checking the encoding, it works too.
If the file has no BOM, then the defaultEncodingIfNoBom encoding will be used. There is also a StreamReader constructor overload without this argument (in this case, the encoding will by default be set to UTF8 before any read), but I recommend to define what you consider the default encoding in your context.
I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.

Providing the implementation details for the steps proposed by #CodesInChaos:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8. https://stackoverflow.com/a/4522251/867248 explains the tactic in more details.
using System; using System.IO; using System.Text;
// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage
public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}
[Test]
public void Test1()
{
Stream fs = File.OpenRead(#".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);
using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...

Check this.
UDE
This is a port of Mozilla Universal Charset Detector and you can use it like this...
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}

I'd try the following steps:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.

.NET is not very helpful, but you can try the following algorithm:
try to find the encoding by BOM(byte order mark) ... very likely not to be found
try parsing into different encodings
Here is the call:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
Here is the code:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}

Look here for c#
https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx
string path = #"path\to\your\file.ext";
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}

This seems to work well.
First create a helper method:
private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}
Then create code to test the source. In this case, I've got a byte array I need to get the encoding of:
public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}
return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.
The way is simple, if all characters are between x00-x7E, ASCII, UTF-8 and Latin-1 are all the same, but if I read a non ASCII file by UTF-8, we will find the special character � show up, so try to read with Latin-1. In Latin-1, between \x7F and \xAF is empty, while GB uses full between x00-xFF so if I got any between the two, it's not Latin-1
The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

The solution proposed by #nonoandy is really interesting, I have succesfully tested it and seems to be working perfectly.
The nuget package needed is Microsoft.ProgramSynthesis.Detection (version 8.17.0 at the moment)
I suggest to use the EncodingTypeUtils.GetDotNetName instead of using a switch for getting the Encoding instance:
using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;
...
public Encoding? DetectEncoding(Stream stream)
{
try
{
if (stream.CanSeek)
{
// Read from the beginning if possible
stream.Seek(0, SeekOrigin.Begin);
}
// Detect encoding type (enum)
var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
// Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);
if (!string.IsNullOrEmpty(encodingDotNetName))
{
return Encoding.GetEncoding(encodingDotNetName);
}
}
catch (Exception e)
{
// Handle exception (log, throw, etc...)
}
// In case of error return null or a default value
return null;
}

I have tried a few different ways to detect encoding and hit issues with most of them.
I made the following leveraging a Microsoft Nuget Package and it seems to work for me so far but needs tested a lot more.
Most of my testing has been on UTF8, UTF8 with BOM and ASNI.
static void Main(string[] args)
{
var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();
int i = 0;
foreach (var line in contents)
{
i++;
Console.WriteLine(line);
if (i > 100)
break;
}
}
public static Encoding GetEncoding(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
switch (detectedEncoding)
{
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
return Encoding.UTF8;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
return Encoding.BigEndianUnicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
return Encoding.Unicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
return Encoding.UTF32;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
return Encoding.ASCII;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
default:
return Encoding.Default;
}
}
}

It may be useful
string path = #"address/to/the/file.extension";
using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.CurrentEncoding);
}

Related

I can't read all Rtf file content

I have a Rtf file and I need read file to parser.
In the file has some special characters, because has images in the file.
When I read all text from file, the content after special characters can't be read.
I tried read file with ReadAllText with Encoding.UTF8 and Encoding.ASCII
public class ReadFile
{
public static string GetFileContent(string path)
{
if (!File.Exists(path))
{
throw new FileNotFoundException();
}
else
{
// I also tried
// return File.ReadAllText(path, Encoding.ASCII);
string text = string.Empty;
var fileStream = new FileStream(path, FileMode.Open, FileAccess.Read);
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
text += line;
}
}
return text;
}
}
}
Actually my result is all text until start special character.
{\rtf1\ansi\ansicpg1252\deff0\deftab720{\fonttbl{\f0\fnil Times New Roman;}{\f1\fnil Arial;}}{\colortbl;\red000\green000\blue000;\red255\green000\blue000;\red128\green128\blue128;}\paperw11905\paperh16837\margl360\margr360\margt360\margb360
\sectd \sectdefaultcl \marglsxn360\margrsxn360\margtsxn360\margbsxn360{ {*\do\dobxpage\dobypage\dodhgt8192\dptxbx{\dptxbxtext\pard\plain {\pict\wmetafile8\picw19499\pich1746\picwgoal1305695\pichgoal116957
\bin342908
Rtf File is here

I made.
To read file I used File.ReadAllBytes(path) and in resulted variable I replace byte 0 by (nul) and byte 27 by esc.
byte[] fileBytes = File.ReadAllBytes(path);
StringBuilder sb = new StringBuilder();
foreach (var b in fileBytes)
{
// handle printable characters
if ((b >= 32) || (b == 10) || (b == 13) || (b == 9)) // lf, cr, tab
sb.Append((char)b);
else
{
// handle control characters
switch (b)
{
case 0: sb.Append("(nul)"); break;
case 27: sb.Append("(esc)"); break;
// etc.
}
}
}
return sb.ToString();
I found the help in

Recreate text file from byte array with correct encoding [duplicate]

Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding.
So precise as Notepad++ is.

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.
*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE
// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}

The following code works fine for me, using the StreamReader class:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
The trick is to use the Peek call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX call before checking the encoding, it works too.
If the file has no BOM, then the defaultEncodingIfNoBom encoding will be used. There is also a StreamReader constructor overload without this argument (in this case, the encoding will by default be set to UTF8 before any read), but I recommend to define what you consider the default encoding in your context.
I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.

Providing the implementation details for the steps proposed by #CodesInChaos:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8. https://stackoverflow.com/a/4522251/867248 explains the tactic in more details.
using System; using System.IO; using System.Text;
// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage
public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}
[Test]
public void Test1()
{
Stream fs = File.OpenRead(#".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);
using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...

Check this.
UDE
This is a port of Mozilla Universal Charset Detector and you can use it like this...
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}

I'd try the following steps:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.

.NET is not very helpful, but you can try the following algorithm:
try to find the encoding by BOM(byte order mark) ... very likely not to be found
try parsing into different encodings
Here is the call:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
Here is the code:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}

Look here for c#
https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx
string path = #"path\to\your\file.ext";
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}

This seems to work well.
First create a helper method:
private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}
Then create code to test the source. In this case, I've got a byte array I need to get the encoding of:
public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}
return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.
The way is simple, if all characters are between x00-x7E, ASCII, UTF-8 and Latin-1 are all the same, but if I read a non ASCII file by UTF-8, we will find the special character � show up, so try to read with Latin-1. In Latin-1, between \x7F and \xAF is empty, while GB uses full between x00-xFF so if I got any between the two, it's not Latin-1
The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

The solution proposed by #nonoandy is really interesting, I have succesfully tested it and seems to be working perfectly.
The nuget package needed is Microsoft.ProgramSynthesis.Detection (version 8.17.0 at the moment)
I suggest to use the EncodingTypeUtils.GetDotNetName instead of using a switch for getting the Encoding instance:
using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;
...
public Encoding? DetectEncoding(Stream stream)
{
try
{
if (stream.CanSeek)
{
// Read from the beginning if possible
stream.Seek(0, SeekOrigin.Begin);
}
// Detect encoding type (enum)
var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
// Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);
if (!string.IsNullOrEmpty(encodingDotNetName))
{
return Encoding.GetEncoding(encodingDotNetName);
}
}
catch (Exception e)
{
// Handle exception (log, throw, etc...)
}
// In case of error return null or a default value
return null;
}

I have tried a few different ways to detect encoding and hit issues with most of them.
I made the following leveraging a Microsoft Nuget Package and it seems to work for me so far but needs tested a lot more.
Most of my testing has been on UTF8, UTF8 with BOM and ASNI.
static void Main(string[] args)
{
var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();
int i = 0;
foreach (var line in contents)
{
i++;
Console.WriteLine(line);
if (i > 100)
break;
}
}
public static Encoding GetEncoding(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
switch (detectedEncoding)
{
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
return Encoding.UTF8;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
return Encoding.BigEndianUnicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
return Encoding.Unicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
return Encoding.UTF32;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
return Encoding.ASCII;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
default:
return Encoding.Default;
}
}
}

It may be useful
string path = #"address/to/the/file.extension";
using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.CurrentEncoding);
}

read unicode string from text file in UWP app

in Windows 10 app I try to read string from .txt file and set the text to RichEditBox:
Code variant 1:
var read = await FileIO.ReadTextAsync(file, Windows.Storage.Streams.UnicodeEncoding.Utf8);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.None, read);
Code variant 2:
var stream = await file.OpenAsync(Windows.Storage.FileAccessMode.ReadWrite);
ulong size = stream.Size;
using (var inputStream = stream.GetInputStreamAt(0))
{
using (var dataReader = new Windows.Storage.Streams.DataReader(inputStream))
{
dataReader.UnicodeEncoding = Windows.Storage.Streams.UnicodeEncoding.Utf8;
uint numBytesLoaded = await dataReader.LoadAsync((uint)size);
string text = dataReader.ReadString(numBytesLoaded);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.FormatRtf, text);
}
}
On some files I have this error - "No mapping for the Unicode character exists in the target multi-byte code page"
I found one solution:
IBuffer buffer = await FileIO.ReadBufferAsync(file);
DataReader reader = DataReader.FromBuffer(buffer);
byte[] fileContent = new byte[reader.UnconsumedBufferLength];
reader.ReadBytes(fileContent);
string text = Encoding.UTF8.GetString(fileContent, 0, fileContent.Length);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.None, text);
But with this code the text looks like question marks in rhombus.
How I can read and display same text files in normal encoding?

Challenge here is the encoding and it depends how much accuracy you need for your application.
If you need something fast and simple you can adapt this answer
public static Encoding GetEncoding(byte[4] bom)
{
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
return Encoding.ASCII;
}
async System.Threading.Tasks.Task MyMethod()
{
FileOpenPicker openPicker = new FileOpenPicker();
StorageFile file = await openPicker.PickSingleFileAsync();
IBuffer buffer = await FileIO.ReadBufferAsync(file);
DataReader reader = DataReader.FromBuffer(buffer);
byte[] fileContent = new byte[reader.UnconsumedBufferLength];
reader.ReadBytes(fileContent);
string text = GetEncoding(new byte[4] {fileContent[0], fileContent[1], fileContent[2], fileContent[3] }).GetString(fileContent);
txt.Document.SetText(Windows.UI.Text.TextSetOptions.None, text);
//..
}
If you need something more accurate you should think to port to UWP a porting to .Net of Mozilla charset detector as already mentioned in this answer
Please note that the code above is just a sample it is missing all the using statements for types implementing IDisposable and it also should have been wrote in a more consistent way
hth
-g

Solution:
1) I made a port of Mozilla Universal Charset Detector to UWP (added to Nuget)
ICharsetDetector cdet = new CharsetDetector();
cdet.Feed(fileContent, 0, fileContent.Length);
cdet.DataEnd();
2) Nuget library Portable.Text.Encoding
if (cdet.Charset != null)
string text = Portable.Text.Encoding.GetEncoding(cdet.Charset).GetString(fileContent, 0, fileContent.Length);
That's all. Now unicode ecnodings (include cp1251, cp1252) - works good ))

StorageFile file = await StorageFile.GetFileFromApplicationUriAsync(new Uri("ms-appx:///Assets/FontFiles/" + fileName));
using (var inputStream = await file.OpenReadAsync())
using (var classicStream = inputStream.AsStreamForRead())
using (var streamReader = new StreamReader(classicStream))
{
while (streamReader.Peek() >= 0)
{
line = streamReader.ReadLine();
}
}

Not able to read special character "£" using Streamreader in c#

I am trying to read a character (£) from a text file, using the following code.
public static List<string> ReadAllLines(string path, bool discardEmptyLines, bool doTrim)
{
var retVal = new List<string>();
if (string.IsNullOrEmpty(path) || !File.Exists(path)) {
Comm.File.Log.LogError("ReadAllLines", string.Format("Could not load file: {0}", path));
return retVal;
}
//StreamReader sr = null;
StreamReader sr = new StreamReader(path, Encoding.Default));
try {
sr = File.OpenText(path);
while (sr.Peek() >= 0) {
var line = sr.ReadLine();
if (discardEmptyLines && (line == null || string.IsNullOrEmpty(line.Trim()))) {
continue;
}
if (line != null) {
retVal.Add(doTrim ? line.Trim() : line);
}
}
}
catch (Exception ex) {
Comm.File.Log.LogGeneralException("ReadAllLines", ex);
}
finally {
if (sr != null) {
sr.Close();
}
}
return retVal;
}
But my code is not correctly reading £, It is reading the character as � please guide me what needs to be done to read the special character.
Thanks in advance.

The file you are reading is not encoded the same as Encoding.Default. It is likely UTF-8. Try using UTF-8 for this particular file. For more generic usage, you should see Determining the Encoding of a text file.

Try to replace Encoding.Default with Encoding.GetEncoding(437)

Look like a encoding problem. Try creating your StreamReader with a UTF-8 (or Unicode) encoding instead of default.
StreamReader sr = new StreamReader(path, Encoding.UTF8));

Encoding information can be provided to stream reader in 2 ways.
1)Save your file with save as option and select the appropriate encoding option from dropdown in windows.
see screenshot
2)If your files are dynamic in nature use Encoding.GetEncoding() with StreamReader

How can I use the DeflateStream class on one line in a file?

I have a file which contains plaintext mixed in with some compressed text, for example:
Version 01
Maker SomeCompany
l 73
mark
h�22V0P���w�/�+Q0���L)�66□ // This line was compressed using DeflateZLib
endmark
It seems that Microsoft has a solution, the DeflateStream class, but their examples show how to use it on an entire file, whereas I can't figure out how to just use it on one line in my file.
So far I have the following:
bool isDeflate = false;
using (var fs = new FileStream(#"C:\Temp\MyFile.dat", FileMode.Open)
using (var reader = new StreamReader(fs))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (isDeflate)
{
if (line == "endmark")
{
isDeflate = false;
}
else
{
line = DeflateSomehow(line);
}
}
if (line == "mark")
{
isDeflate = true;
}
Console.WriteLine(line);
}
}
public string DeflateSomehow(string line)
{
// How do I deflate just that string?
}
Since the file is not created by me (we're only reading it in), we have no control over its structure... but, I'm not tied down to the code I have right now. If I need to change more of it than simply figuring out how to implement the DeflateSomehow method, than I'm fine with that as well.

A deflate stream works on binary data. An arbitrary binary chunk in the middle of a text file is also known as: a corrupt text file. There is no sane way of decoding this:
you can't read "lines", because there is no definition of a "line" when talking about binary data; any combination of CR/LF/CRLF/etc could occur completely by random in the binary data
you can't read a "string line", because that suggests you are running the data through an Encoding; but since this isn't text data, again: that will simply give you gibberish that cannot be processed (it will have lost data when reading)
Now, the second of these two problems is solvable by reading via the Stream API rather than the StreamReader API, so that you are only ever reading binary; you would then need to look for the line endings yourself, using an Encoding to probe what you can (noting that this isn't as simple as it sounds if you are using multi/variable-byte encodings such as UTF-8).
However, the first of these two problems is inherently not solvable by itself. To do this reliably, you would need some kind of binary framing protocol - which again, does not exist in a text file. It looks like the example is using "mark" and "endmark" - again, there is technically a chance that these would occur at random, but you'll probably get away with it for the 99.999% case. The trick, then, would be to read the entire file manually using Stream and Encoding, looking for "mark" and "endmark" - and stripping the bits that are encoded as text from the bits that are compressed data. Then run the encoded-as-text piece through the correct Encoding.
However! At the point when you are reading binary, then it is simple: you simply buffer the right amount (using whatever framing/sentinel protocol the data is written in), and use something like:
using(var ms = new MemoryStream(bytes))
using(var inflate = new GZipStream(ms, CompressionMode.Decompress))
{
// now read from 'inflate'
}
With the addition of the l 73 marker, and the information that it is ASCII, it becomes a little more viable.
This won't work for me because the data here on SO is already corrupted (posting binary as text does that), but basically something like:
using System;
using System.Collections.Generic;
using System.IO;
using System.IO.Compression;
using System.Text;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
using (var file = File.OpenRead("my.txt"))
using (var buffer = new MemoryStream())
{
List<string> lines = new List<string>();
string line;
while ((line = ReadToCRLF(file, buffer)) != null)
{
lines.Add(line);
Console.WriteLine(line);
if (line == "mark" && lines.Count >= 2)
{
var match = Regex.Match(lines[lines.Count - 2], "^l ([0-9]+)$");
int bytes;
if (match.Success && int.TryParse(match.Groups[1].Value, out bytes))
{
ReadBytes(file, buffer, bytes);
string inflated = Inflate(buffer);
lines.Add(inflated); // or something similar
Console.WriteLine(inflated);
}
}
}
}
}
static string Inflate(Stream source)
{
using (var deflate = new DeflateStream(source, CompressionMode.Decompress, true))
using (var reader = new StreamReader(deflate, Encoding.ASCII))
{
return reader.ReadToEnd();
}
}
static void ReadBytes(Stream source, MemoryStream buffer, int count)
{
buffer.SetLength(count);
int read, offset = 0;
while (count > 0 && (read = source.Read(buffer.GetBuffer(), offset, count)) > 0)
{
count -= read;
offset += read;
}
if (count != 0) throw new EndOfStreamException();
buffer.Position = 0;
}
static string ReadToCRLF(Stream source, MemoryStream buffer)
{
buffer.SetLength(0);
int next;
bool wasCr = false;
while ((next = source.ReadByte()) >= 0)
{
if(next == 10 && wasCr) { // CRLF
// end of line (minus the CR)
return Encoding.ASCII.GetString(
buffer.GetBuffer(), 0, (int)buffer.Length - 1);
}
buffer.WriteByte((byte)next);
wasCr = next == 13;
}
// end of file
if (buffer.Length == 0) return null;
return Encoding.ASCII.GetString(buffer.GetBuffer(), 0, (int)buffer.Length);
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Effective way to find any file's Encoding - c#

Yes is a most frequent question, and this matter is vague for me and since I don't know much about it. But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.

I'd try the following steps: 1) Check if there is a Byte Order Mark 2) Check if the file is valid UTF8 3) Use the local "ANSI" codepage (ANSI as Microsoft defines it) Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.

It may be useful string path = #"address/to/the/file.extension"; using (StreamReader sr = new StreamReader(path)) { Console.WriteLine(sr.CurrentEncoding); }

Related

I can't read all Rtf file content

Recreate text file from byte array with correct encoding [duplicate]

read unicode string from text file in UWP app

Not able to read special character "£" using Streamreader in c#

How can I use the DeflateStream class on one line in a file?

Categories

Resources