reading large file, wrong file size

reading large file, wrong file size - c#

I'm trying to read a large file from a disk and report percentage while it's loading. The problem is FileInfo.Length is reporting different size than my Encoding.ASCII.GetBytes().Length.
public void loadList()
{
string ListPath = InnerConfig.dataDirectory + core.operation[operationID].Operation.Trim() + "/List.txt";
FileInfo f = new FileInfo(ListPath);
int bytesLoaded = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
byte[] array = Encoding.ASCII.GetBytes(line);
bytesLoaded += array.Length;
}
}
MessageBox.Show(bytesLoaded + "/" + f.Length);
}
The result is
13357/15251
There's 1900 bytes 'missing'. The file contains list of short strings. Any tips why it's reporting different file sizes? does it has to do anything with '\r' and '\n' characters in the file? In addition, I have the following line:
int bytesLoaded = 0;
if the file is lets say 1GB large, do I have to use 'long' instead? Thank you for your time!

Your intuition is correct; the difference in the reported sizes is due to the newline characters. Per the MSDN documentation on StreamReader.ReadLine:
The string that is returned does not contain the terminating carriage return or line feed.
Depending on the source which created your file, each newline would consist of either one or two characters (most commonly: \r\n on Windows; just \n on Linux).
That said, if your intention is to read the file as a sequence of bytes (without regard to lines), you should use the FileStream.Read method, which avoids the overhead of ASCII encoding (as well as returns the correct count in total):
byte[] array = new byte[1024]; // buffer
int total = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open,
FileAccess.Read, FileShare.ReadWrite))
{
int read;
while ((read = fs.Read(array, 0, array.Length)) > 0)
{
total += read;
// process "array" here, up to index "read"
}
}
Edit: spender raises an important point about character encodings; your code should only be used on ASCII text files. If your file was written using a different encoding – the most popular today being UTF-8 – then results may be incorrect.
Consider, for example, the three-byte hex sequence E2-98-BA. StreamReader, which uses UTF8Encoding by default, would decode this as a single character, ☺. However, this character cannot be represented in ASCII; thus, calling Encoding.ASCII.GetBytes("☺") would return a single byte corresponding to the ASCII value of the fallback character, ?, thereby leading to loss in character count (as well as incorrect processing of the byte array).
Finally, there is also the possibility of an encoding preamble (such as Unicode byte order marks) at the beginning of the text file, which would also be stripped by the ReadLine, resulting in a further discrepancy of a few bytes.

It's the line endings which get swallowed by ReadLine, and could also possibly be because your source file is in a more verbose encoding than ASCII (perhaps it's UTF8?).
int.MaxValue is 2147483647, so you're going to run into problem using an int for bytesLoaded if your file is >2GB. Switch to a long. After all, FileInfo.Length is defined as a long.

The ReadLine method removes the trailing line termination character.

Related

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

The abc.txt File Contents are
ABCDEFGHIJ•XYZ
Now, The Character Shown is Fine if I use this code (i.e. Seek to position 9),
string filePath = "D:\\abc.txt";
FileStream fs = new FileStream(filePath, FileMode.Open);
StreamReader sr = new StreamReader(fs, new UTF8Encoding(true), true);
sr.BaseStream.Seek(9, SeekOrigin.Begin);
char[] oneChar = new char[1];
char ch = (char)sr.Read(oneChar, 0, 1);
MessageBox.Show(oneChar[0].ToString());
But if the SEEK position is Just after that Special Dot Character, then I Get Junk Character.
So, I get Junk Character if I do Seek to position 11 (i.e. just after the dot position)
sr.BaseStream.Seek(11, SeekOrigin.Begin);
This should give 'X', because the character at 11th position is X.
I think the File contents are legally UTF8.
There is also one more thing,
The StreamReader BaseStream length and the StreamReader Contents Length is different.
MessageBox.Show(sr.BaseStream.Length.ToString());
MessageBox.Show(sr.ReadToEnd().Length.ToString());

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding
It is exactly because of UTF-8 that sr.BaseStream is giving junk characters. :)
StreamReader is a relatively "smarter" stream. It understands how strings work, whereas FileStream (i.e. sr.BaseStream) doesn't. FileStream only knows about bytes.
Since your file is encoded in UTF-8 (a variable-length encoding), letters like A, B and C are encoded with 1 byte, but the • character needs 3 bytes. You can get how many bytes a character needs by doing:
Console.WriteLine(Encoding.UTF8.GetByteCount("•"));
So when you move the stream to "the position just after •", you haven't actually moved past the •, you are just on the second byte of it.
The reason why the Lengths are different is similar: StreamReader gives you the number of characters, whereas sr.BaseStream gives you the number of bytes.

c# how to add a line break to a memory stream

I am merging 3 files, for example, but at final there are not line breaks between the files...
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
File.OpenRead("c:\file2.txt").CopyTo(m);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
how can I may add a line break to a memory stream?

You can write the line break to the stream. You need to decide which one you want. Probably, you want Encoding.Xxx.GetBytes(Environment.NewLine). You also need to decide which encoding to use (which must match the encoding of the other files).
Since the line break string is ASCII what matters is only the distinction between single-byte encodings and ones that use more. Unicode uses two bytes per newline char for example.
If you need to guess you probably should go with UTF 8 without BOM.
You also can try a fully text based approach:
var result = File.ReadAllLines(a) + Environment.NewLine + File.ReadAllLines(b);
Let me also point out that you need to dispose the streams that you open.

Quick and dirty:
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
m.WriteByte(0x0A); // this is the ASCII code for \n line feed
// You might want or need \n\r in which case you'd
// need to write 0x0D as well.
File.OpenRead("c:\file2.txt").CopyTo(m);
m.WriteByte(0x0A);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
But as #usr points out, you really should think about the encoding.

Assuming you know the encoding, for example UTF-8, you can do:
using (var ms = new MemoryStream())
{
// Do stuff ...
var newLineBytes = Encoding.UTF8.GetBytes(Environment.NewLine);
ms.Write(newLineBytes, 0, newLineBytes.Length);
// Do more stuff ...
}

How to overwrite specific bytes in dump file in C#

i'm having a mysql dump with some special characters ("Ä, ä, Ö, ö, Ü, ü, ß"). I have to reimport this dump into the latest mysql version. This is crashing the special characters because of the encoding. The dump is not encoded with UTF-8.
Within this dump there are also some binary attachments which should not be overwritten. Otherwise the attachments will be broken.
I have to overwrite every special character with the bytes that are readable for UTF-8.
I'm currently trying it that way (this is changing the ANSI ü to an for UTF-8 readable ü):
newByteArray[y] = 195;
if (bytesFromLine[i] == 252)
{
newByteArray[y + 1] = 188;
}
newByteArray[y + 2] = bytesFromLine[y + 1];
252 is displaying a 'ü' in Encoding.Default. 195 188 is displaying a 'ü' in Encoding.UTF8.
Now i need help with searching this specific characters in this dump file an overwriting this bytes with the right bytes. I can't replace all '252' with '195 188' because the attachments would get broken then.
Thanks in advance.
Relax

DISCLAIMER: This might corrupt your data. The best way of dealing with this is to get a proper mysqldump from the source database. This solution should only be use when you don't have that option and stuck with a potentially broken dump file.
Assuming all strings in the dump file in quotes (using single quote ') and can be escaped as \':
INSERT INTO `some_table` VALUES (123, 'this is a string', ...
Not too sure how binary data is represented. That might need more checks, you need to check your dump file and see if these assumptions are correct.
const char quote = '\'';
const char escape = '\\';
using (var dumpOut = new FileStream("dump_out.txt", FileMode.Create, FileAccess.Write))
using (var dumpIn = new FileStream("dump_in.txt", FileMode.Open, FileAccess.Read))
{
bool inquotes = false;
byte previousByte = 0;
var stringBytes = new List<byte>();
while (true)
{
int readByte = dumpIn.ReadByte();
if (readByte == -1) break;
var b = (byte) readByte;
if (b == quote && previousByte != escape)
{
if (inquotes) // closing quote
{
var buffer = stringBytes.ToArray();
stringBytes.Clear();
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, buffer);
dumpOut.Write(converted, 0, converted.Length);
dumpOut.WriteByte(b);
}
else // opening quote
{
dumpOut.WriteByte(b);
}
inquotes = !inquotes;
continue;
}
previousByte = b;
if (inquotes)
stringBytes.Add(b);
else
dumpOut.WriteByte(b);
}
}

Seeking for a line i a text file

I need some assistance, I am writing a method to read a text file, and if any exception occurs I append a line to the text file. e.g "**"
So what I need to know is how can I check for that specific line of text in the text file without reading every line of the text file, like a peek method or something.
Any help would be appreciated.
Thanks in advance.

You can use File.ReadLines in combination with Any:
bool isExcFile = System.IO.File.ReadLines(path).Any(l => l == "**");
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.

I have found a solution, the line I have appended to the file will always be the last line in the file, so I created a method to read the last line. See below:
public string ReadLastLine(string path)
{
string returnValue = "";
FileStream fs = new FileStream(path, FileMode.Open);
for (long pos = fs.Length - 2; pos > 0; --pos)
{
fs.Seek(pos, SeekOrigin.Begin);
StreamReader ts = new StreamReader(fs);
returnValue = ts.ReadToEnd();
int eol = returnValue .IndexOf("\n");
if (eol >= 0)
{
fs.Close();
return returnValue .Substring(eol + 1);
}
}
fs.Close();
return returnValue ;
}

You will need to maintain a separate file with indexes (such as comma delimited) of where your special markers are. Then you can only read those indexes and use the Seek method to jump to that point in the filestream.
If your file is relatively small, let's say <50MB this is an overkill. More than that you can consider maintaining the index file. You basically have to weigh the performance of an extra IO call (that is reading the index file) with that of simply reading from the filestream each line.

From what I understand you want to process some files and after the processing find out which files contain the "**" symbol, without reading every line of the file.
If you append the "**" to the end of the file you could do something like:
using (StreamReader sr = new StreamReader(File.OpenText(fileName)))
{
sr.BaseStream.Seek(-3, SeekOrigin.End);
string endToken = sr.ReadToEnd();
if (endToken == "**\n")
{
// if needed, go back to start of file:
sr.BaseStream.Seek(0, SeekOrigin.Begin);
// do something with the file
}
}

How do I count the number of bytes read by TextReader.ReadLine()?

I am parsing a very large file of records (one per line, each of varying length), and I'd like to keep track of the number of bytes I've read in the file so that I may recover in the event of a failure.
I wrote the following:
using (TextReader myTextReader = CreateTextReader())
{
string record = myTextReader.ReadLine();
bytesRead += record.Length;
ParseRecord(record);
}
However this doesn't work since ReadLine() strips any CR/LF characters in the line. Furthermore, a line may be terminated by either CR, LF, or CRLF characters, which means I can't just add 1 to bytesRead.
Is there an easy way to get the actual line length, or do I write my own ReadLine() method in terms of the granular Read() operations?

Getting the current position of the underlying stream won't help, since the StreamReader will buffer data read from the stream.
Essentially you can't do this without writing your own StreamReader. But do you really need to?
I would simply count the number of lines read.
Of course, this means that to position to a specific line you will need to read N lines rather than simply seeking to an offset, but what's wrong with that? Have you determined that performance will be unacceptable?

A TextReader reads strings, which are characters, which [depending on the encoding] isn't equal to bytes.
How about just storing number of lines read, and just skip that many lines when recovering? I guess that it's all about not processing those line, not necessarily avoiding to read them from the stream.

If you are reading a string, you can use regular expression matches and count the number of characters.
var regex = new Regex("^(.*)$", RegexOptions.Compiled | RegexOptions.Multiline);
var matches = regex.Matches(text);
var count = matches.Count;
for (var matchIndex = 0; matchIndex < count; ++matchIndex)
{
var match = matches[matchIndex];
var group = match.Groups[1];
var value = group.Captures[0].Value;
Console.WriteLine($"Line {matchIndex + 1} (pos={match.Index}): {value}");
}

Come to think of it, I can use a StreamReader and get the current position of the underlying stream as follows.
using (StreamReader myTextReader = CreateStreamReader())
{
stringRecord = myTextReader.ReadLine();
bytesRead += myTextReader.BaseStream.Position;
ParseRecord(record);
// ...
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

reading large file, wrong file size - c#

The ReadLine method removes the trailing line termination character.

Related

Why is StreamReader and sr.BaseStream.Seek() giving Junk Characters even in UTF8 Encoding

c# how to add a line break to a memory stream

How to overwrite specific bytes in dump file in C#

Seeking for a line i a text file

How do I count the number of bytes read by TextReader.ReadLine()?

Categories

Resources