How do I count the number of bytes read by TextReader.ReadLine()? - c#

I am parsing a very large file of records (one per line, each of varying length), and I'd like to keep track of the number of bytes I've read in the file so that I may recover in the event of a failure.
I wrote the following:
using (TextReader myTextReader = CreateTextReader())
{
string record = myTextReader.ReadLine();
bytesRead += record.Length;
ParseRecord(record);
}
However this doesn't work since ReadLine() strips any CR/LF characters in the line. Furthermore, a line may be terminated by either CR, LF, or CRLF characters, which means I can't just add 1 to bytesRead.
Is there an easy way to get the actual line length, or do I write my own ReadLine() method in terms of the granular Read() operations?

Getting the current position of the underlying stream won't help, since the StreamReader will buffer data read from the stream.
Essentially you can't do this without writing your own StreamReader. But do you really need to?
I would simply count the number of lines read.
Of course, this means that to position to a specific line you will need to read N lines rather than simply seeking to an offset, but what's wrong with that? Have you determined that performance will be unacceptable?

A TextReader reads strings, which are characters, which [depending on the encoding] isn't equal to bytes.
How about just storing number of lines read, and just skip that many lines when recovering? I guess that it's all about not processing those line, not necessarily avoiding to read them from the stream.

If you are reading a string, you can use regular expression matches and count the number of characters.
var regex = new Regex("^(.*)$", RegexOptions.Compiled | RegexOptions.Multiline);
var matches = regex.Matches(text);
var count = matches.Count;
for (var matchIndex = 0; matchIndex < count; ++matchIndex)
{
var match = matches[matchIndex];
var group = match.Groups[1];
var value = group.Captures[0].Value;
Console.WriteLine($"Line {matchIndex + 1} (pos={match.Index}): {value}");
}

Come to think of it, I can use a StreamReader and get the current position of the underlying stream as follows.
using (StreamReader myTextReader = CreateStreamReader())
{
stringRecord = myTextReader.ReadLine();
bytesRead += myTextReader.BaseStream.Position;
ParseRecord(record);
// ...
}

Related

c# how to add a line break to a memory stream

I am merging 3 files, for example, but at final there are not line breaks between the files...
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
File.OpenRead("c:\file2.txt").CopyTo(m);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
how can I may add a line break to a memory stream?
You can write the line break to the stream. You need to decide which one you want. Probably, you want Encoding.Xxx.GetBytes(Environment.NewLine). You also need to decide which encoding to use (which must match the encoding of the other files).
Since the line break string is ASCII what matters is only the distinction between single-byte encodings and ones that use more. Unicode uses two bytes per newline char for example.
If you need to guess you probably should go with UTF 8 without BOM.
You also can try a fully text based approach:
var result = File.ReadAllLines(a) + Environment.NewLine + File.ReadAllLines(b);
Let me also point out that you need to dispose the streams that you open.
Quick and dirty:
MemoryStream m = new MemoryStream();
File.OpenRead("c:\file1.txt").CopyTo(m);
m.WriteByte(0x0A); // this is the ASCII code for \n line feed
// You might want or need \n\r in which case you'd
// need to write 0x0D as well.
File.OpenRead("c:\file2.txt").CopyTo(m);
m.WriteByte(0x0A);
File.OpenRead("c:\file3.txt").CopyTo(m);
m.Position = 0;
Console.WriteLine(new StreamReader(m).ReadToEnd());
But as #usr points out, you really should think about the encoding.
Assuming you know the encoding, for example UTF-8, you can do:
using (var ms = new MemoryStream())
{
// Do stuff ...
var newLineBytes = Encoding.UTF8.GetBytes(Environment.NewLine);
ms.Write(newLineBytes, 0, newLineBytes.Length);
// Do more stuff ...
}

TextReader.Read() not returning correct integer?

So my method should in theory work, I'm just not getting my expected result back.
I have a function that creates a new TextReader class, reads in a character (int) from my text file and adds it too a list.
The textfile data looks like the following (48 x 30):
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111100000000001111111111000000111111111
111111110000000000000000000000000000000011111111
100000000000000000000000000000000000000001111111
000000000000001111111111111111111111000001111111
100000001111111111111111111112211221000001111111
100000111111122112211221122111111111000001111111
111111111221111111111111111112211110000011111111
111112211111111111111111111111111100000111221111
122111111111111122111100000000000000001111111111
111111111111111111100000000000000000011111111111
111111111111111111000000000000000001112211111111
111111111111221110000001111110000111111111111111
111111111111111100000111112211111122111111111111
111111112211110000001122111111221111111111111111
111122111111000000011111111111111111112211221111
111111110000000011111111112211111111111111111111
111111000000001111221111111111221122111100000011
111111000000011111111111000001111111110000000001
111111100000112211111100000000000000000000000001
111111110000111111100000000000000000000000000011
111111111000011100000000000000000000000011111111
111111111100000000000000111111111110001111111111
111111111110000000000011111111111111111111111111
111111111111100000111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
111111111111111111111111111111111111111111111111
My method is as follows:
private void LoadReferenceMap(string FileName)
{
FileName = Path.Combine(Environment.CurrentDirectory, FileName);
List<int> ArrayMapValues = new List<int>();
if (File.Exists(FileName))
{
// Create a new stream to write to the file
using (TextReader reader = File.OpenText(FileName))
{
for (int i = 0; i < 48; i++)
{
for (int j = 0; j < 30; j++)
{
int x = reader.Read();
if (x == -1)
break;
ArrayMapValues.Add(x);
}
}
}
level.SetFieldMap(ArrayMapValues);
}
}
Returns:
As you can see once it reaches the end of the first line Read() returns 13 and then 10 before moving on to the next row?
A different approach that removes both problems with conversion of chars to integers and the skipping of Environment.NewLine characters
private void LoadReferenceMap(string FileName)
{
List<int> ArrayMapValues = new List<int>();
if (File.Exists(FileName))
{
foreach(string line in File.ReadLines(FileName))
{
var lineMap = line.ToCharArray()
.Select(x => Convert.ToInt32(x.ToString()));
ArrayMapValues.AddRange(lineMap);
}
level.SetFieldMap(ArrayMapValues);
}
}
The file is small, so it seems to be convenient to read a line as a string (this removes the Environment.NewLine), process the line converting it to a char array and applying the conversion to integer for each char. Finally the List of integers of a single line could be added to your List of integers for all the file.
I have not inserted any check on the length of a single line (48 chars) and the total number of lines (30) because you say that every file has this format. However, adding a small check on the total lines loaded and their lengths, should be pretty simple.
This is because you need to convert the symbol you've got to the char, like this:
(char)sr.Read();
After that you can parse it as int with different approach, for example:
int.Parse(((char)sr.Read()).ToString());
More information on MSDN.
As you can see once it reaches the end of the first line Read() returns 13 and then 10 before moving on to the next row?
The line break in the .NET looks like this: \r\n, and not the \n (Check the Environment.NewLine property.
The actual text file has line breaks in it. This means that once you have read the first 48 characters the next thing in the file is a line break. In this case it is a standard windows new line which is a Carriage Return (character 13) followed by a Line Feed (character 10).
You need to deal with these line breaks in your code somehow. My preferred way of doing this would be the method outlined by Steve above (using File.ReadAllLines). You could alternatively though just at the end of each of your sets of 48 character reads check for the 13/10 character combo. One thing of note though is that some systems just use Line Feed without the carriage return to indicate new lines. Depending on the source of these files you may need to code something to deal with possible different line breaks. Using ReadAllLines will let something else deal with this issue though as would using reader.ReadLine()
If you are also unsure why it is returning 49 instead of 1 then you need to understand about character encoding. The file is stored as bytes which are interpreted by the reading program. In this case you are reading out the values of the characters as integers (which is how .NET stores them internally). You need to convert this to a character. In this case you can just cast to char (ie (char)x). This will then return a char which is '1'. If you want this as an integer you would then need to use Integer.Parse to parse from text into an integer.

reading large file, wrong file size

I'm trying to read a large file from a disk and report percentage while it's loading. The problem is FileInfo.Length is reporting different size than my Encoding.ASCII.GetBytes().Length.
public void loadList()
{
string ListPath = InnerConfig.dataDirectory + core.operation[operationID].Operation.Trim() + "/List.txt";
FileInfo f = new FileInfo(ListPath);
int bytesLoaded = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line;
while ((line = sr.ReadLine()) != null)
{
byte[] array = Encoding.ASCII.GetBytes(line);
bytesLoaded += array.Length;
}
}
MessageBox.Show(bytesLoaded + "/" + f.Length);
}
The result is
13357/15251
There's 1900 bytes 'missing'. The file contains list of short strings. Any tips why it's reporting different file sizes? does it has to do anything with '\r' and '\n' characters in the file? In addition, I have the following line:
int bytesLoaded = 0;
if the file is lets say 1GB large, do I have to use 'long' instead? Thank you for your time!
Your intuition is correct; the difference in the reported sizes is due to the newline characters. Per the MSDN documentation on StreamReader.ReadLine:
The string that is returned does not contain the terminating carriage return or line feed.
Depending on the source which created your file, each newline would consist of either one or two characters (most commonly: \r\n on Windows; just \n on Linux).
That said, if your intention is to read the file as a sequence of bytes (without regard to lines), you should use the FileStream.Read method, which avoids the overhead of ASCII encoding (as well as returns the correct count in total):
byte[] array = new byte[1024]; // buffer
int total = 0;
using (FileStream fs = File.Open(ListPath, FileMode.Open,
FileAccess.Read, FileShare.ReadWrite))
{
int read;
while ((read = fs.Read(array, 0, array.Length)) > 0)
{
total += read;
// process "array" here, up to index "read"
}
}
Edit: spender raises an important point about character encodings; your code should only be used on ASCII text files. If your file was written using a different encoding – the most popular today being UTF-8 – then results may be incorrect.
Consider, for example, the three-byte hex sequence E2-98-BA. StreamReader, which uses UTF8Encoding by default, would decode this as a single character, ☺. However, this character cannot be represented in ASCII; thus, calling Encoding.ASCII.GetBytes("☺") would return a single byte corresponding to the ASCII value of the fallback character, ?, thereby leading to loss in character count (as well as incorrect processing of the byte array).
Finally, there is also the possibility of an encoding preamble (such as Unicode byte order marks) at the beginning of the text file, which would also be stripped by the ReadLine, resulting in a further discrepancy of a few bytes.
It's the line endings which get swallowed by ReadLine, and could also possibly be because your source file is in a more verbose encoding than ASCII (perhaps it's UTF8?).
int.MaxValue is 2147483647, so you're going to run into problem using an int for bytesLoaded if your file is >2GB. Switch to a long. After all, FileInfo.Length is defined as a long.
The ReadLine method removes the trailing line termination character.

How to add linebreaks to a stream reader if conditions are met

So I have code that needs to check if the file has already been split every 50 characters. 99% of the time it will come to me already split, where each line is 50 characters, however there is an off chance that it may come to me as a single line, and I need to add a linebreak every 50 characters. This file will always come to me as a stream.
Once I have the properly formatted file, I process it as needed.
However, I am uncertain how I can check if the stream is properly formatted.
Here is the code I have to check if the first line if larger than 50 characters(an indicator it may need to be split).
var streamReader = new StreamReader(s);
var firstLineCount = streamReader.ReadLines().Count();
if(firstLineCount > 50)
{
//code to add line breaks
}
//once the file is good
using(var trackReader = new TrackingTextReader(streamReader))
{
//do biz logic
}
How can I add linebreaks to a stream reader?
I would add all lines to a List<string>. (Line by line)
Do the check for each item in the list (using for, not foreach, because we will be inserting items).
If some item in the list has more than 50 characters.
Add an item to the next index of the list using item.SubString(50) (all the string after the 50th character).
And cut the final of the item at current index using YourList[i] = YourList[i].SubString(0,50).
Funny comment someone did helped for this:
You can also create a StreamWriter to write the Stream you're reading with the corrections.
Then you get the produced Stream and pass it forward to what you need.
You can't write anything to TextReader, because... it is a reader. The option here is to make a well-formed copy of data:
private IEnumerable<string> GetWellFormedData(Stream s)
{
using (var reader = new StreamReader(s))
{
while (!reader.EndOfStream)
{
var nextLine = reader.ReadLine();
if (nextLine.Length > 50)
{
// break the line into 50-chars fragments and yield return fragments
}
else
yield return nextLine;
}
}
}

Seeking for a line i a text file

I need some assistance, I am writing a method to read a text file, and if any exception occurs I append a line to the text file. e.g "**"
So what I need to know is how can I check for that specific line of text in the text file without reading every line of the text file, like a peek method or something.
Any help would be appreciated.
Thanks in advance.
You can use File.ReadLines in combination with Any:
bool isExcFile = System.IO.File.ReadLines(path).Any(l => l == "**");
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
I have found a solution, the line I have appended to the file will always be the last line in the file, so I created a method to read the last line. See below:
public string ReadLastLine(string path)
{
string returnValue = "";
FileStream fs = new FileStream(path, FileMode.Open);
for (long pos = fs.Length - 2; pos > 0; --pos)
{
fs.Seek(pos, SeekOrigin.Begin);
StreamReader ts = new StreamReader(fs);
returnValue = ts.ReadToEnd();
int eol = returnValue .IndexOf("\n");
if (eol >= 0)
{
fs.Close();
return returnValue .Substring(eol + 1);
}
}
fs.Close();
return returnValue ;
}
You will need to maintain a separate file with indexes (such as comma delimited) of where your special markers are. Then you can only read those indexes and use the Seek method to jump to that point in the filestream.
If your file is relatively small, let's say <50MB this is an overkill. More than that you can consider maintaining the index file. You basically have to weigh the performance of an extra IO call (that is reading the index file) with that of simply reading from the filestream each line.
From what I understand you want to process some files and after the processing find out which files contain the "**" symbol, without reading every line of the file.
If you append the "**" to the end of the file you could do something like:
using (StreamReader sr = new StreamReader(File.OpenText(fileName)))
{
sr.BaseStream.Seek(-3, SeekOrigin.End);
string endToken = sr.ReadToEnd();
if (endToken == "**\n")
{
// if needed, go back to start of file:
sr.BaseStream.Seek(0, SeekOrigin.Begin);
// do something with the file
}
}

Categories

Resources