How to tell if a file is text-readable in C# - c#

Part of a list of projects I'm doing is a little text-editor.
At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.
What I want the functionality to be is to only add the files that are readable by a normal text reader.
This code currently adds it to the tree:
TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;
directoryNode.Nodes.Add(navNode);
I know I could easily create an if statement with something like:
if(file.extension.equals(".txt"))
but I would have to expand that statement to contain every single extension that it could possibly be.
Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.

There is no general way of figuring type of information stored in the file.
Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.
Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.

There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.
Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.
What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.
public static async Task<bool> IsValidTextFileAsync(string path,
int scanLength = 4096)
{
using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var bufferLength = (int)Math.Min(scanLength, stream.Length);
var buffer = new char[bufferLength];
var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
reader.Close();
if(bytesRead != bufferLength)
throw new IOException("There was an error reading from the file.");
for(int i = 0; i < bytesRead; i++)
{
var c = buffer[i];
if(char.IsControl(c))
return false;
}
return true;
}
}

My approach based on #Rubenisme's comment and #Erik's answer.
public static bool IsValidTextFile(string path)
{
using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8))
{
var bytesRead = reader.ReadToEnd();
reader.Close();
return bytesRead.All(c => // Are all the characters either a:
c == (char)10 // New line
|| c == (char)13 // Carriage Return
|| c == (char)11 // Tab
|| !char.IsControl(c) // Non-control (regular) character
);
}
}

A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)

Related

Convert a file froM Shift-JIS to UTF8 No BOM without re-reading from disk

I am dealing with files in many formats, including Shift-JIS and UTF8 NoBOM. Using a bit of language knowledge, I can detect if the files are being interepeted correctly as UTF8 or ShiftJIS, but if I detect that the file is not of the type I read in, I was wondering if there is a way to just reinterperet my in-memory array without having to re-read the file with a new encoding specified.
Right now, I read in the file assuming Shift-JIS as such:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true))
{
String line = sr.ReadToEnd();
// Detection must be done AFTER you read from the file. Silly rabbit.
fileFormatCertain = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
codingFromBOM = sr.CurrentEncoding;
}
and after I do my magic to determine if it is either a known format (has a BOM) or that the data makes sense as Shift-JIS, all is well. If the data is garbage though, then I am re-reading the file via:
using (StreamReader sr = new StreamReader(path, Encoding.UTF8))
{
String line = sr.ReadToEnd();
}
I am trying to avoid this re-read step and reinterperet the data in memory if possible.
Or is magic already happening and I am needlessly worrying about double I/O access?
var buf = File.ReadAllBytes(path);
var text = Encoding.UTF8.GetString(buf);
if (text.Contains("\uFFFD")) // Unicode replacement character
{
text = Encoding.GetEncoding(932).GetString(buf);
}

Writing a string to MemoryStream corrupts the input

I have an ASP.net project and want to return a CSV file when an AJAX post is sent (yes it works. See Handle file download from AJAX post).
The special thing is, that I want to create the result in a MemoryStream to return it as FileResult.
But my problem is now, that German umlauts (ä, ö, ü) get corrupted. So here is my code:
public ActionResult Download(FormCollection form) {
string[] v = new string[16];
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream,
System.Text.Encoding.GetEncoding("Windows-1252"));
SqlCommand cmd = dbconn.CreateCommand();
//create SQL command
while (rs.Read()) {
v = new string[16];
v[0] = rs.GetString("IstAktiv");
v[1] = rs.GetString("Haus");
//cache all the values
...
//write cached values
for (int i = 0; i < v.Length; i++) {
if (i > 0) writer.Write(";");
writer.Write(v[i]);
writer.Flush();
}
writer.Write("\r\n");
writer.Flush();
} //end while rs.Read()
FileContentResult ret = new FileContentResult(stream.ToArray(), "text/csv");
ret.FileDownloadName = "Kontakte.csv";
writer.Close();
return ret;
} //end method
So when I open the resulting file in Excel the umlauts are converted into something strange. For example the upper case letter "Ä" is changed to "�".
So is there any possibility to solve this issue?
Best regards
To have Excel read CSV files correctly, it expects the CSV file to be in the UTF-8 encoding (with BOM).
So, without a doubt, your StreamWriter would have to be set this way:
StreamWriter writer = new StreamWriter(stream,
System.Text.Encoding.GetEncoding("UTF-8"));
However, if that doesn't work for you, then it's very likely that it's because the characters are being corrupted before you even get a chance to write them to the stream. You may be facing an encoding conversion problem as you are reading the data from the database.
v = new string[16];
v[0] = rs.GetString("IstAktiv");
v[1] = rs.GetString("Haus");
To validate that, place a breakpoint as you read the values into the 'v' array, and check that the characters still look ok at this step. If they are corrupted, then you know that the problem is between the code and the database, and the writing to the CSV is not the problem.
EDIT: Here is an isolated test case you can use to prove that UTF-8 is the correct encoding to write CSVs. Perhaps you can try that first:
Encoding enc = Encoding.GetEncoding("UTF-8");
using (StreamWriter writer = new StreamWriter(#"d:\test\test.csv", false, enc))
{
writer.Write(#"""hello ä, ö, ü world""");
}

Advanced TextReader to EndOfFile

I have a textReader that in a specific instance I want to be able to advance to the end of file quickly so other classes that might hold a reference to this object will not be able to call tr.ReadLine() without getting a null.
This is a large file. I cannot use TextReader.ReadToEnd() as it will often lead to an OutOfMemoryException
I thought I would ask the community if there was a way SEEK the stream without using TextReader.ReadToEnd() which returns a string of all data in the file.
Current method, inefficient.
The following example code is a mock up. Obviously I am not opening a file with an if statement directly following it asking if I want to read to the end.
TextReader tr = new StreamReader("Largefile");
if(needToAdvanceToEndOfFile)
{
while(tr.ReadLine() != null) { }
}
Desired solution (Note this code block contains fake 'concept' methods or methods that cannot be used due to risk of outofmemoryexception)
TextReader tr = new StreamReader("Largefile");
if(needToAdvanceToEndOfFile)
{
tr.SeekToEnd(); // A method that does not return anything. This method does not exist.
// tr.ReadToEnd() not acceptable as it can lead to OutOfMemoryException error as it is very large file.
}
A possible alternative is to read through the file in bigger chunks using tr.ReadBlock(args).
I poked around ((StreamReader)tr).BaseStream but could not find anything that worked.
As I am new to the community I figured I would see if someone knew the answer off the top of their head.
You have to discard any buffered data if you have read any file content - since data is buffered you might get content even if you seek the underlying stream to the end - working example:
StreamReader sr = new StreamReader(fileName);
string sampleLine = sr.ReadLine();
//discard all buffered data and seek to end
sr.DiscardBufferedData();
sr.BaseStream.Seek(0, SeekOrigin.End);
The problem as mentioned in the documentation is
The StreamReader class buffers input from the underlying stream when
you call one of the Read methods. If you manipulate the position of
the underlying stream after reading data into the buffer, the position
of the underlying stream might not match the position of the internal
buffer. To reset the internal buffer, call the DiscardBufferedData
method
Use
reader.BaseStream.Seek(0, SeekOrigin.End);
Test:
using (StreamReader reader = new StreamReader(#"Your Large File"))
{
reader.BaseStream.Seek(0, SeekOrigin.End);
int read = reader.Read();//read will be -1 since you are at the end of the stream
}
Edit: Test it with your code:
using (TextReader tr = new StreamReader("C:\\test.txt"))//test.txt is a file that has data and lines
{
((StreamReader)tr).BaseStream.Seek(0, SeekOrigin.End);
string foo = tr.ReadLine();
Debug.WriteLine(foo ?? "foo is null");//foo is null
int read = tr.Read();
Debug.WriteLine(read);//-1
}

Adding text to beginning and end of file in C#

I have a process which picks up a series of "xml" files. The reason I put xml in quotes is that that the text in the file does not have a root element which makes in invalid xml. In my processing, I want to correct this and open up each file add a root node to the beginning and end of each file, and then close it up. Here is what I had in mind, but this involves opening the file, reading the entire file, tagging on the nodes, and then writing the entire file out. These files may be more than 20 MB in size.
foreach (FileInfo file in files)
{
//open the file
StreamReader sr = new StreamReader(file.FullName);
// add the opening and closing tags
string text = "<root>" + sr.ReadToEnd() + "<root>";
sr.Close();
// now open the same file for writing
StreamWriter sw = new StreamWriter(file.FullName, false);
sw.Write(text);
sw.Close();
}
Any recommendations?
To avoid holding the whole file in memory, rename the original file, then open it with StreamReader. Then open the original filename with StreamWriter to create a new file.
Write the <root> prefix to the file, then copy data in large-ish chunks from the reader to the writer. When you've transferred all the data, write the closing </root> (note the forward slash if you want it to be XML). Then close both files and delete the renamed original.
char[] buffer = new char[10000];
string renamedFile = file.FullName + ".orig";
File.Move(file.FullName, renamedFile);
using (StreamReader sr = new StreamReader(renamedFile))
using (StreamWriter sw = new StreamWriter(file.FullName, false))
{
sw.Write("<root>");
int read;
while ((read = sr.Read(buffer, 0, buffer.Length)) > 0)
sw.Write(buffer, 0, read);
sw.Write("</root>");
}
File.Delete(renamedFile);
20 MB is not terribly much, but when you read it as a string, it will use about 40 MB of memory. That's not terribly much either, but it's processing that you don't need to do. You can handle it as raw bytes to reduce the memory usage, and to avoid decoding and re-encoding the data:
byte[] start = Encoding.UTF8.GetBytes("<root>");
byte[] ending = Encoding.UTF8.GetBytes("</root>");
byte[] data = File.ReadAllBytes(file.FullName);
int bom = (data[0] == 0xEF) ? 3 : 0;
using (FileStream s = File.Create(file.FullName)) {
if (bom > 0) {
s.Write(data, 0, bom);
}
s.Write(start, 0, start.Length);
s.Write(data, bom, data.Length - bom);
s.Write(ending, 0, ending.Length);
}
If you need to recude the memory usage much more, use a second file as Earwicker suggested.
Edit:
Added code to handle BOM (byte order mark).
I can't see any real improvement on this...which is kind of a bummer. Since there's no way to "shift" a file you'll always have to move the bytes in the entire file to inject anything at the top.
You may find some performance benefit by using raw streams rather than the StreamReader which has to actually parse the stream as text.
If you do not want to do this is C#, it would be easy to handle at the commandline or in a batch file.
ECHO ^<root^> > outfile.xml
TYPE temp.xml >> outfile.xml
ECHO ^</root^> >> outfile.xml
This would assume that you have some existing process for getting the data files that this could be hooked into.

C#: Cycle through encodings

I am reading files in various formats and languages and I am currently using a small encoding library to take attempt to detect the proper encoding (http://www.codeproject.com/KB/recipes/DetectEncoding.aspx).
It's pretty good, but it still misses occasionally. (Multilingual files)
Most of my potential users have very little understanding of encoding (the best I can hope for is "it has something to do with characters") and are very unlikely to be able to choose the right encoding in a list, so I would like to let them cycle through different encodings until the right one is found just by clicking on a button.
Display problems? Click here to try a different encoding! (Well that's the concept anyway)
What would be the best way to implement something like that?
Edit: Looks like I didn't express myself clearly enough. By "cycling through the encoding", I don't mean "how to loop through encodings?"
What I meant was "how to let the user try different encodings in sequence without reloading the file?"
The idea is more like this: Let's say the file is loaded with the wrong encoding. Some strange characters are displayed. The user would click a button "Next encoding" or "previous encoding", and the string would be converted in a different encoding. The user just need to keep clicking until the right encoding is found. (whatever encoding looks good for the user will do fine). As long as the user can click "next", he has a reasonable chance of solving his problem.
What I have found so far involves converting the string to bytes using the current encoding, then converting the bytes to the next encoding, converting those bytes into chars, then converting the char into a string... Doable, but I wonder if there isn't an easier way to do that.
For instance, if there was a method that would read a string and return it using a different encoding, something like "render(string, encoding)".
Thanks a lot for the answers!
Read the file as bytes and use then the Encoding.GetString Method.
byte[] data = System.IO.File.ReadAllBytes(path);
Console.WriteLine(Encoding.UTF8.GetString(data));
Console.WriteLine(Encoding.UTF7.GetString(data));
Console.WriteLine(Encoding.ASCII.GetString(data));
So you have to load the file only one time. You can use every encoding based on the original bytes of the file. The user can select the correct one und you can use the result of Encoding.GetEncoding(...).GetString(data) for further processing.
(removed original answer following question update)
For instance, if there was a method
that would read a string and return it
using a different encoding, something
like "render(string, encoding)".
I don't think you can re-use the string data. The fact is: if the encoding was wrong, this string can be considered corrupt. It may very easily contain gibberish among the likely looking characters. In particular, many encodings may forgive the presence/absence of a BOM/preamble, but would you re-encode with it? without it?
If you are happy to risk it (I wouldn't be), you could just re-encode your local string with the last encoding:
// I DON'T RECOMMEND THIS!!!!
byte[] preamble = lastEncoding.GetPreamble(),
content = lastEncoding.GetBytes(text);
byte[] raw = new byte[preamble.Length + content.Length];
Buffer.BlockCopy(preamble, 0, raw, 0, preamble.Length);
Buffer.BlockCopy(content, 0, raw, preamble.Length, content.Length);
text = nextEncoding.GetString(raw);
In reality, I believe the best you can do is to keep the original byte[] - keep offering different renderings (via different encodings) until they like one. Something like:
using System;
using System.IO;
using System.Text;
using System.Windows.Forms;
class MyForm : Form {
[STAThread]
static void Main() {
Application.EnableVisualStyles();
Application.Run(new MyForm());
}
ComboBox encodings;
TextBox view;
Button load, next;
byte[] data = null;
void ShowData() {
if (data != null && encodings.SelectedIndex >= 0) {
try {
Encoding enc = Encoding.GetEncoding(
(string)encodings.SelectedValue);
view.Text = enc.GetString(data);
} catch (Exception ex) {
view.Text = ex.ToString();
}
}
}
public MyForm() {
load = new Button();
load.Text = "Open...";
load.Dock = DockStyle.Bottom;
Controls.Add(load);
next = new Button();
next.Text = "Next...";
next.Dock = DockStyle.Bottom;
Controls.Add(next);
view = new TextBox();
view.ReadOnly = true;
view.Dock = DockStyle.Fill;
view.Multiline = true;
Controls.Add(view);
encodings = new ComboBox();
encodings.Dock = DockStyle.Bottom;
encodings.DropDownStyle = ComboBoxStyle.DropDown;
encodings.DataSource = Encoding.GetEncodings();
encodings.DisplayMember = "DisplayName";
encodings.ValueMember = "Name";
Controls.Add(encodings);
next.Click += delegate { encodings.SelectedIndex++; };
encodings.SelectedValueChanged += delegate { ShowData(); };
load.Click += delegate {
using (OpenFileDialog dlg = new OpenFileDialog()) {
if (dlg.ShowDialog(this)==DialogResult.OK) {
data = File.ReadAllBytes(dlg.FileName);
Text = dlg.FileName;
ShowData();
}
}
};
}
}
How about something like this:
public string LoadFile(string path)
{
stream = GetMemoryStream(path);
string output = TryEncoding(Encoding.UTF8);
}
public string TryEncoding(Encoding e)
{
stream.Seek(0, SeekOrigin.Begin)
StreamReader reader = new StreamReader(stream, e);
return reader.ReadToEnd();
}
private MemoryStream stream = null;
private MemorySteam GetMemoryStream(string path)
{
byte[] buffer = System.IO.File.ReadAllBytes(path);
return new MemoryStream(buffer);
}
Use LoadFile on your first try; then use TryEncoding subsequently.
Could you let the user enter some words (with "special" characters) that are supposed to occur in the file?
You can search all encodings yourself to see if these words are present.
Beware of the infamous 'Notepad bug'. It's going to bite you whatever you try, though... You can find some good discussions about encodings and their challenges on MSDN (and other places).
You have to keep the original data as a byte array or MemoryStream you can then translate to the new encoding, once you already converted your data to a string you can't reliably return to the original representation.

Categories

Resources