C#: Cycle through encodings - c#

I am reading files in various formats and languages and I am currently using a small encoding library to take attempt to detect the proper encoding (http://www.codeproject.com/KB/recipes/DetectEncoding.aspx).
It's pretty good, but it still misses occasionally. (Multilingual files)
Most of my potential users have very little understanding of encoding (the best I can hope for is "it has something to do with characters") and are very unlikely to be able to choose the right encoding in a list, so I would like to let them cycle through different encodings until the right one is found just by clicking on a button.
Display problems? Click here to try a different encoding! (Well that's the concept anyway)
What would be the best way to implement something like that?
Edit: Looks like I didn't express myself clearly enough. By "cycling through the encoding", I don't mean "how to loop through encodings?"
What I meant was "how to let the user try different encodings in sequence without reloading the file?"
The idea is more like this: Let's say the file is loaded with the wrong encoding. Some strange characters are displayed. The user would click a button "Next encoding" or "previous encoding", and the string would be converted in a different encoding. The user just need to keep clicking until the right encoding is found. (whatever encoding looks good for the user will do fine). As long as the user can click "next", he has a reasonable chance of solving his problem.
What I have found so far involves converting the string to bytes using the current encoding, then converting the bytes to the next encoding, converting those bytes into chars, then converting the char into a string... Doable, but I wonder if there isn't an easier way to do that.
For instance, if there was a method that would read a string and return it using a different encoding, something like "render(string, encoding)".
Thanks a lot for the answers!

Read the file as bytes and use then the Encoding.GetString Method.
byte[] data = System.IO.File.ReadAllBytes(path);
Console.WriteLine(Encoding.UTF8.GetString(data));
Console.WriteLine(Encoding.UTF7.GetString(data));
Console.WriteLine(Encoding.ASCII.GetString(data));
So you have to load the file only one time. You can use every encoding based on the original bytes of the file. The user can select the correct one und you can use the result of Encoding.GetEncoding(...).GetString(data) for further processing.

(removed original answer following question update)
For instance, if there was a method
that would read a string and return it
using a different encoding, something
like "render(string, encoding)".
I don't think you can re-use the string data. The fact is: if the encoding was wrong, this string can be considered corrupt. It may very easily contain gibberish among the likely looking characters. In particular, many encodings may forgive the presence/absence of a BOM/preamble, but would you re-encode with it? without it?
If you are happy to risk it (I wouldn't be), you could just re-encode your local string with the last encoding:
// I DON'T RECOMMEND THIS!!!!
byte[] preamble = lastEncoding.GetPreamble(),
content = lastEncoding.GetBytes(text);
byte[] raw = new byte[preamble.Length + content.Length];
Buffer.BlockCopy(preamble, 0, raw, 0, preamble.Length);
Buffer.BlockCopy(content, 0, raw, preamble.Length, content.Length);
text = nextEncoding.GetString(raw);
In reality, I believe the best you can do is to keep the original byte[] - keep offering different renderings (via different encodings) until they like one. Something like:
using System;
using System.IO;
using System.Text;
using System.Windows.Forms;
class MyForm : Form {
[STAThread]
static void Main() {
Application.EnableVisualStyles();
Application.Run(new MyForm());
}
ComboBox encodings;
TextBox view;
Button load, next;
byte[] data = null;
void ShowData() {
if (data != null && encodings.SelectedIndex >= 0) {
try {
Encoding enc = Encoding.GetEncoding(
(string)encodings.SelectedValue);
view.Text = enc.GetString(data);
} catch (Exception ex) {
view.Text = ex.ToString();
}
}
}
public MyForm() {
load = new Button();
load.Text = "Open...";
load.Dock = DockStyle.Bottom;
Controls.Add(load);
next = new Button();
next.Text = "Next...";
next.Dock = DockStyle.Bottom;
Controls.Add(next);
view = new TextBox();
view.ReadOnly = true;
view.Dock = DockStyle.Fill;
view.Multiline = true;
Controls.Add(view);
encodings = new ComboBox();
encodings.Dock = DockStyle.Bottom;
encodings.DropDownStyle = ComboBoxStyle.DropDown;
encodings.DataSource = Encoding.GetEncodings();
encodings.DisplayMember = "DisplayName";
encodings.ValueMember = "Name";
Controls.Add(encodings);
next.Click += delegate { encodings.SelectedIndex++; };
encodings.SelectedValueChanged += delegate { ShowData(); };
load.Click += delegate {
using (OpenFileDialog dlg = new OpenFileDialog()) {
if (dlg.ShowDialog(this)==DialogResult.OK) {
data = File.ReadAllBytes(dlg.FileName);
Text = dlg.FileName;
ShowData();
}
}
};
}
}

How about something like this:
public string LoadFile(string path)
{
stream = GetMemoryStream(path);
string output = TryEncoding(Encoding.UTF8);
}
public string TryEncoding(Encoding e)
{
stream.Seek(0, SeekOrigin.Begin)
StreamReader reader = new StreamReader(stream, e);
return reader.ReadToEnd();
}
private MemoryStream stream = null;
private MemorySteam GetMemoryStream(string path)
{
byte[] buffer = System.IO.File.ReadAllBytes(path);
return new MemoryStream(buffer);
}
Use LoadFile on your first try; then use TryEncoding subsequently.

Could you let the user enter some words (with "special" characters) that are supposed to occur in the file?
You can search all encodings yourself to see if these words are present.

Beware of the infamous 'Notepad bug'. It's going to bite you whatever you try, though... You can find some good discussions about encodings and their challenges on MSDN (and other places).

You have to keep the original data as a byte array or MemoryStream you can then translate to the new encoding, once you already converted your data to a string you can't reliably return to the original representation.

Related

Convert.ToBase64String throws 'System.OutOfMemoryException' for byte [] (file: large size)

I am trying to convert byte[] to base64 string format so that i can send that information to third party. My code as below:
byte[] ByteArray = System.IO.File.ReadAllBytes(path);
string base64Encoded = System.Convert.ToBase64String(ByteArray);
I am getting below error:
Exception of type 'System.OutOfMemoryException' was thrown. Can you
help me please ?
Update
I just spotted #PanagiotisKanavos' comment pointing to Is there a Base64Stream for .NET?. This does essentially the same thing as my code below attempts to achieve (i.e. allows you to process the file without having to hold the whole thing in memory in one go), but without the overhead/risk of self-rolled code / rather using a standard .Net library method for the job.
Original
The below code will create a new temporary file containing the Base64 encoded version of your input file.
This should have a lower memory footprint, since rather than doing all data at once, we handle it several bytes at a time.
To avoid holding the output in memory, I've pushed that back to a temp file, which is returned. When you later need to use that data for some other process, you'd need to stream it (i.e. so that again you're not consuming all of this data at once).
You'll also notice that I've used WriteLine instead of Write; which will introduce non base64 encoded characters (i.e. the line breaks). That's deliberate, so that if you consume the temp file with a text reader you can easily process it line by line.
However, you can amend per your needs.
void Main()
{
var inputFilePath = #"c:\temp\bigfile.zip";
var convertedDataPath = ConvertToBase64TempFile(inputFilePath);
Console.WriteLine($"Take a look in {convertedDataPath} for your converted data");
}
//inputFilePath = where your source file can be found. This is not impacted by the below code
//bufferSizeInBytesDiv3 = how many bytes to read at a time (divided by 3); the larger this value the more memory is required, but the better you'll find performance. The Div3 part is because we later multiple this by 3 / this ensures we never have to deal with remainders (i.e. since 3 bytes = 4 base64 chars)
public string ConvertToBase64TempFile(string inputFilePath, int bufferSizeInBytesDiv3 = 1024)
{
var tempFilePath = System.IO.Path.GetTempFileName();
using (var fileStream = File.Open(inputFilePath,FileMode.Open))
{
using (var reader = new BinaryReader(fileStream))
{
using (var writer = new StreamWriter(tempFilePath))
{
byte[] data;
while ((data = reader.ReadBytes(bufferSizeInBytesDiv3 * 3)).Length > 0)
{
writer.WriteLine(System.Convert.ToBase64String(data)); //NB: using WriteLine rather than Write; so when consuming this content consider removing line breaks (I've used this instead of write so you can easily stream the data in chunks later)
}
}
}
}
return tempFilePath;
}

Memory Issue in string C#

I have little test program
public class Test
{
public string Response { get; set; }
}
My console simply call Test class
class Program
{
static void Main(string[] args)
{
Test t = new Test();
using (StreamReader reader = new StreamReader("C:\\Test.txt"))
{
t.Response = reader.ReadToEnd();
}
t.Response = t.Response.Substring(0, 5);
Console.WriteLine(t.Response);
Console.Read();
}
}
I have appox 60 MB data in my Test.txt file. When the program get executes, it is taking lot of memory because string is immutable. What is the better way handle this kind of scenario using string.
I know that i can use string builder. but i have created this program to replicate a scenario in one of my production application which uses string.
when i tried with GC.Collect(), memory is released immediately. I am not sure whether i can call GC in code.
Please help. Thanks.
UPDATE:
I think i did not explain it clearly. sorry for the confusion.
I am just reading data from file to get huge data as don't want create 60MB of data in code.
My pain point is below line of code where i have huge data in Response field.
t.Response = t.Response.Substring(0, 5);
You could limit your reads to a block of bytes (buffer). Loop through and read the next block into your buffer and write that buffer out. This will prevent a large chunk of data being stored in memory.
using (StreamReader reader = new StreamReader(#"C:\Test.txt", true))
{
char[] buffer = new char[1024];
int idx = 0;
while (reader.ReadBlock(buffer, idx, buffer.Length) > 0)
{
idx += buffer.Length;
Console.Write(buffer);
}
}
Can you read your file line by line? If so, I would recommend calling:
IEnumerable<string> lines = File.ReadLines(path)
When you iterate this collection using
foreach(string line in lines)
{
// do something with line
}
the collection will be iterated using lazy evaluation. That means the entire contents of the file won't need to be kept in memory while you do something with each line.
StreamReader provides just version of Read that you looking for - Read(Char[], Int32, Int32) - which lets you pick out first characters of the stream. Alternatively you can read char-by-char with regular StreamReader.Read till you decided that you have enough.
var textBuffer = new char[5];
reader.ReadToEnd(textBuffer, 0, 5); // TODO: check if it actually read engough
t.Response = new string(textBuffer);
Note that if you know encoding of the stream you may use lower level reading as byte array and use System.Text.Encoding classes to construct strings with encoding yourself instead of relaying on StreamReader.

Binary file read pre-defined number of Byte/ Bytes using C#

Scenario
I have a Binary file which is a output from a certain system. The vendor has provided us with the description of the file encoding. Its very complicated because the encoding follows a certain methodology. For eg. the first Byte is ISO coded, we need to decode it first, if the value matches the provided list then it has some meaning. Then the next 15 Bytes also ISO encoded, we need to decode it and compare. Similarly after certain position, few Bytes are Binary encoded.. so on and so forth.
Action so far
I will be using C# WinForm Application. So far I have looked at various documents and all point to FileStream/ BinaryReader combination, since my file size are in the range of 1G to 1.8G. I cannot put the whole file in a Byte[] either.
Problem
I am facing issue in reading the file Byte by Byte. According to the above scenario, first I need to read only 1 Byte then 15 Bytes then 10 Bytes and so on and so forth. How to accomplish this. Thanks in advance for your help.
BinaryReader is the way to go, as it uses a stream the memory usage will be low.
Now you can do something like below :
internal struct MyHeader
{
public byte FirstByte;
// etc
}
internal class MyFormat
{
private readonly string _fileName;
private MyFormat(string fileName)
{
_fileName = fileName;
}
public MyHeader Header { get; private set; }
public string FileName
{
get { return _fileName; }
}
public static MyFormat FromFileName(string fileName)
{
if (fileName == null) throw new ArgumentNullException("fileName");
// read the header of your file
var header = new MyHeader();
using (var reader = new BinaryReader(File.OpenRead(fileName)))
{
byte b1 = reader.ReadByte();
if (b1 != 0xAA)
{
// return null or throw an exception
}
header.FirstByte = b1;
// you can also read block of bytes with a BinaryReader
var readBytes = reader.ReadBytes(10);
// etc ... whenever something's wrong return null or throw an exception
}
// when you're done reading your header create and return the object
var myFormat = new MyFormat(fileName);
myFormat.Header = header;
// the rest of the object is delivered only when needed, see method below
return myFormat;
}
public object GetBigContent()
{
var content = new object();
// use FileName and Header property to get your big content and return it
// again, use a BinaryReader with 'using' statement here
return content;
}
}
Explanations
Call MyFormat.FromFileName to create one of these object, inside it :
you parse the header, whenever an error occurs return null or throw an exception
once your header is parsed you create the object and return it and that's it
Since you just read the header, provide a way for reading the bigger parts in the file.
Pseudo-example:
Use GetBigContent or whatever you want to call it whenever you need to read a large part of it.
Using Header and FileName inside that method you will have everything you need to return a content from this file on-demand.
By using this approach,
you quickly return a valid object by only parsing its header
you do not consume 1.8Gb at first call
you return only what the user needs, on-demand
For your encoding-related stuff the Encoding class will probably be helpful to you :
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
BinaryReader.ReadBytes Method
Reads the specified number of bytes from the current stream into a byte array and advances the current position by that number of bytes.
public virtual byte[] ReadBytes(int count)
http://msdn.microsoft.com/en-us/library/system.io.binaryreader.readbytes(v=vs.110).aspx

How to tell if a file is text-readable in C#

Part of a list of projects I'm doing is a little text-editor.
At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.
What I want the functionality to be is to only add the files that are readable by a normal text reader.
This code currently adds it to the tree:
TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;
directoryNode.Nodes.Add(navNode);
I know I could easily create an if statement with something like:
if(file.extension.equals(".txt"))
but I would have to expand that statement to contain every single extension that it could possibly be.
Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.
There is no general way of figuring type of information stored in the file.
Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.
Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.
There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.
Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.
What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.
public static async Task<bool> IsValidTextFileAsync(string path,
int scanLength = 4096)
{
using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var bufferLength = (int)Math.Min(scanLength, stream.Length);
var buffer = new char[bufferLength];
var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
reader.Close();
if(bytesRead != bufferLength)
throw new IOException("There was an error reading from the file.");
for(int i = 0; i < bytesRead; i++)
{
var c = buffer[i];
if(char.IsControl(c))
return false;
}
return true;
}
}
My approach based on #Rubenisme's comment and #Erik's answer.
public static bool IsValidTextFile(string path)
{
using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8))
{
var bytesRead = reader.ReadToEnd();
reader.Close();
return bytesRead.All(c => // Are all the characters either a:
c == (char)10 // New line
|| c == (char)13 // Carriage Return
|| c == (char)11 // Tab
|| !char.IsControl(c) // Non-control (regular) character
);
}
}
A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)

Determining text file encoding schema

I am trying to create a method that can detect the encoding schema of a text file. I know there are many out there, but I know for sure my text file with be either ASCII, UTF-8, or UTF-16. I only need to detect these three. Anyone know a way to do this?
First, open the file in binary mode and read it into memory.
For UTF-8 (or ASCII), do a validation check. You can decode the text using Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes) and catch the exception. If you don't get one, the data is valid UTF-8. Here is the code:
private bool detectUTF8Encoding(string filename)
{
byte[] bytes = File.ReadAllBytes(filename);
try {
Encoding.GetEncoding("UTF-8", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback).GetString(bytes);
return true;
} catch {
return false;
}
}
For UTF-16, check for the BOM (FE FF or FF FE, depending on byte order).
Use the StreamReader to identify the encoding.
Example:
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}

Categories

Resources