My app 'reads' text files line by line and processes the data as/when needed.
The reason I read one line at a time is, although most files are fairly small, some are actually rather large.
So, rather than reading everything into a single string and possibly running out of memory, I read the file line by line.
...
const string filename = /*get filename*/
using (var sr = new StreamReader(filename))
{
string line;
while ((line = await sr.ReadLineAsync().ConfigureAwait(false)) != null)
{
// process 'line'
}
}
...
But looking at the, (.NET 4.7.2), code for ReadLineAsync( ... ) there is also a possibility that the line in question could exhaust all the memory available, (in the case of a large file with no '\n' or '\r' in it).
The Beauty of ReadlineAsync(...) is that it takes care of all the Encoding and so on.
So my question is, how can I safely read a line up to a certain number of characters?
Related
In .Net core, I have huge text files that need to be converted from Unix to Windows.
Since I can't load the file completly in memory (the files are too big), I read each byte one after the other, and when I encounter a LF, I output a LF+CR. This process works, but it takes a long time for huge files. Is there a more efficently way to do?
I thought about using a StreamReader, but the problem I'm having is that we don't know the source file encoding.
Any idea?
Thank you
Without knowing more about the specific files you're trying to process, I'd probably start off with something like the below and see if that gets me the results I want.
Depending on the specifics of your situation you may be able to do something more efficient, but if you're handling truly large datasets with unstructured text then it's usually a matter of throwing more powerful hardware at the problem if speed is still an issue.
You don't have to specify the Encoding to make use of the StreamReader class. Was there a specific problem with the reader you encountered?
const string inputFilePath = "";
const string outputFilePath = "";
using var sr = new StreamReader(inputFilePath);
using var sw = new StreamWriter(outputFilePath);
string line;
// Buffers each line into memory, but not the newline characters.
while ((line = await sr.ReadLineAsync()) != null)
{
// Write the contents of the string out to the "fixed" file (manually
// specifying the line ending you want).
await sw.WriteAsync(line + "\r\n");
}
For the following operation:
Open a text file
Search and replace all searching characters with new characters
I'd like to achieve above in c#, here is my code:
using (StreamReader sr = new StreamReader(#"S:\Personal Folders\A\TESTA.txt"))
{
using (StreamWriter sw = new StreamWriter(#"S:\Personal Folders\A\TESTB.txt"))
{
string line;
while ((line = sr.ReadLine())!= null)
{
if (!line.Contains("before"))
{
sw.WriteLine(line);
}
else if (line.Contains("before"))
{
sw.WriteLine(line.Replace("before", "after"));
}
}
}
}
Basically, the above code will generate a new file with the desired replace operation, but as you can see, the way I am doing is read each line of the original file and write to a new file. This could achieve my goal, but it may have system IO issue because it is reading and writing for each line. Also, I cannot read all the lines to an array first, and then write, because the file is large and if I try to write to an string[], replace all, then write the array to the file, will bring about the memory timeout issue.
Is there any way that I can just locate to the specific lines, and just replace those lines and keep all the rest? Or What is the best way to solve the above problem? Thanks
I don't know what IO issue you are worried about, but your code should work ok. You can code more concisely as follows:
using (StreamReader sr = new StreamReader(#"S:\Personal Folders\A\TESTA.txt"))
{
using (StreamWriter sw = new StreamWriter(#"S:\Personal Folders\A\TESTB.txt"))
{
while ((string line = sr.ReadLine())!= null)
{
sw.WriteLine(line.Replace("before", "after"));
}
}
}
This will run a bit faster because it searches for "before" only once per line. By default the StreamWriter buffers your writes and does not flush to the disk each time you call WriteLine, and file IO is asynchronous in the operating system, so don't worry so much about IO.
In general, what you are doing is correct, possibly followed by some renames to replace the original file. If you do want to replace the original file, you should rename the original file to a temporary name, rename the new file to the original name, and then either leave or delete the original file. You must handle conflicts with your temporary name and errors in all renames.
Consider you are replacing a six character string with a five character string - if you write back to the original file, what will you do with the extra characters? Files are stored on disk as raw bytes of data, there is no "text" file on disk. What if you replace a string with a longer one - you then potentially have to move the entire rest of the file to make room to write the longer line.
You can imagine the file on disk as letters written on graph paper in the boxes. The end of each line is noted by a special character (or characters - in Windows, that is CRLF), the characters fill all the boxes horizontally. If you tried to replace words on the graph paper you would have to erase and re-write lots of letters. Writing on a new sheet will be easiest.
Well, your approach is basically fine... but I wouldn't check if the line contains the word before... the trade-off is not good enough:
using (StreamReader sr = new StreamReader(#"S:\Personal Folders\A\TESTA.txt"))
{
using (StreamWriter sw = new StreamWriter(#"S:\Personal Folders\A\TESTB.txt"))
{
String line;
while ((line = sr.ReadLine()) != null)
sw.WriteLine(line.Replace("before", "after"));
}
}
Try following :
else if (line.Contains("before"))
{
sw.WriteLine(line.Replace("before", "after"));
sw.Write(sr.ReadToEnd());
break;
}
I want to remove blank lines from my file, foe that I am using code below.
private void ReadFile(string Address)
{
var tempFileName = Path.GetTempFileName();
try
{
//using (var streamReader = new StreamReader(Server.MapPath("~/Images/") + FileName))
using (var streamReader = new StreamReader(Address))
using (var streamWriter = new StreamWriter(tempFileName))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
if (!string.IsNullOrWhiteSpace(line))
streamWriter.WriteLine(line);
}
}
File.Copy(tempFileName, Address, true);
}
finally
{
File.Delete(tempFileName);
}
Response.Write("Completed");
}
But the problem is my file is too large (8 lac lines ) so its taking lot of time. So is there any other way to do it faster?
Instead of doing a ReadLine(), I would do a StreamReader.ReadToEnd() to load the entire file into memory, then do a line.Replace("\n\n","\n") and then do a streamWrite.Write(line) to the file. That way there is not a lot of thrashing, either memory or disk, going on.
The best solution may well depend on the disk type - SSDs and spinning rust behave differently. Your current approach has the advantage over Steve's answer of being able to do processing (such as encoding text data back as binary) while data is still coming off the disk. (With buffering and background IO, there's a lot of potential asynchrony here.) It's definitely worth trying both approaches. (Obviously your approach uses less memory, too.)
However, there's one aspect of your code which is definitely suboptimal: creating a copy of the results. You don't need to do that. You can use file moves instead which are a lot more efficient, assuming they're all in the same drive. To make sure you don't lose data, you can do two moves and a delete:
Move the old file to a backup filename
Move the new file to the old filename
Delete the backup filename
It looks like this is what File.Replace does for you, which makes it considerably simpler, and also preserves the original metadata.
If something goes wrong after the first move, you're left without the "proper" file from either old or new, but you can detect that and use the backup filename to read next time.
Of course, if this is meant to happen as part of a web request, you may want to do all the processing in a background task - processing 800,000 lines of text is likely to take longer than you really want a web request to take...
I want to read big TXT file size is 500 MB,
First I use
var file = new StreamReader(_filePath).ReadToEnd();
var lines = file.Split(new[] { '\n' });
but it throw out of memory Exception then I tried to read line by line but again after reading around 1.5 million lines it throw out of memory Exception
using (StreamReader r = new StreamReader(_filePath))
{
while ((line = r.ReadLine()) != null)
_lines.Add(line);
}
or I used
foreach (var l in File.ReadLines(_filePath))
{
_lines.Add(l);
}
but Again I received
An exception of type 'System.OutOfMemoryException' occurred in
mscorlib.dll but was not handled in user code
My Machine is powerful machine with 8GB of ram so it shouldn't be my machine problem.
p.s: I tried to open this file in NotePadd++ and I received 'the file is too big to be opened' exception.
Just use File.ReadLines which returns an IEnumerable<string> and doesn't load all the lines at once to the memory.
foreach (var line in File.ReadLines(_filePath))
{
//Don't put "line" into a list or collection.
//Just make your processing on it.
}
The cause of exception seem to be growing _lines collection but not reading big file. You are reading line and adding to some collection _lines which will be taking memory and causing out of memory execption. You can apply filters to only put the required lines to _lines collection.
I know this is an old post but Google sent me here in 2021..
Just to emphasize igrimpe's comments above:
I've run into an OutOfMemoryException on StreamReader.ReadLine() recently looping through folders of giant text files.
As igrimpe mentioned, you can sometimes encounter this where your input file exhibits a lack of uniformity in line breaks. If you are looping through a textfile and encounter this, double check your input file for unexpected characters / ascii encoded hex or binary strings, etc.
In my case, I split the 60 gb problematic file into 256mb chunks, had my file iterator stash the problematic textfiles as part of the exception trap and later remedied the problem textfiles by removing the problematic lines.
Edit:
loading the whole file in memory will be causing objects to grow, and .net will throw OOM exceptions if it cannot allocate enough contiguous memory for an object.
The answer is still the same, you need to stream the file, not read the entire contents. That may require a rearchitecture of your application, however using IEnumerable<> methods you can stack up business processes in different areas of the applications and defer processing.
A "powerful" machine with 8GB of RAM isn't going to be able to store a 500GB file in memory, as 500 is bigger than 8. (plus you don't get 8 as the operating system will be holding some, you can't allocate all memory in .Net, 32-bit has a 2GB limit, opening the file and storing the line will hold the data twice, there is an object size overhead....)
You can't load the whole thing into memory to process, you will have to stream the file through your processing.
You have to count the lines first.
It is slower, but you can read up to 2,147,483,647 lines.
int intNoOfLines = 0;
using (StreamReader oReader = new
StreamReader(MyFilePath))
{
while (oReader.ReadLine() != null) intNoOfLines++;
}
string[] strArrLines = new string[intNoOfLines];
int intIndex = 0;
using (StreamReader oReader = new
StreamReader(MyFilePath))
{
string strLine;
while ((strLine = oReader.ReadLine()) != null)
{
strArrLines[intIndex++] = strLine;
}
}
For anyone else having this issue:
If you're running out of memory while using StreamReader.ReadLine(), I'd be willing to bet your file doesn't have multiple lines to begin with. You're just assuming it does. It's an easy mistake to make because you can't just open a 10GB file with Notepad.
One time I received a 10GB file from a client that was supposed to be a list of numbers and instead of using '\n' as a separator, he used commas. The whole file was a single line which obviously caused ReadLine() to blow up.
Try reading a few thousand characters from your stream using StreamReader.Read() and see if you can find a '\n'. Odds are you won't.
I am trying to read from a pbx file using StreamReader, edit the contents and display the contents to a new file using TextReader in c#.
This is my first developement task in a c#.
I studied java at uni and my new job uses c#.
Basically I have to read through a list of records contained in a pbx file from a phone system. These records however, have a line of good call records followed by a line with a few dodgy characters, followed by another line of good records.
My task is to read through this file line by line and then write a piece of code to ignore the lines with dodgy characters and output the good records into a new file on my c:\ drive which ive called output.txt.
I can write the while loop to take out the dodgy characters but im unsure of the c# code to read from the pbx file on my c drive and then output the edited contents to a new file called output.txt, also on my c drive.
I'm new to c# and have explored google for hours on this. Just need a little guidance and I'm away...
You didn't mention the file encodings, so I'm sticking with the UTF-8 defaults here.
One option is the 'regular' method of a loop that reads, checks, and conditionally writes, like this:
var inputFilePath = #"C:\temp\input.txt";
var outputFilePath = #"C:\temp\output.txt";
using (var reader = File.OpenText(inputFilePath))
using (var writer = File.CreateText(outputFilePath))
{
string line;
while ((line = reader.ReadLine()) != null)
{
var isValidLine = CheckLine(line);
if (isValidLine)
{
writer.WriteLine(line);
}
}
}
Since you tagged this VS2008, I'm guessing that means you're limited to .NET 3.5, but on 4.0 or later, you can read and write enumerables and then leverage that (in .NET 3.5 you'd have to read all the lines into memory, filter, then write all the lines).
var inputFilePath = #"C:\temp\input.txt";
var outputFilePath = #"C:\temp\output.txt";
var inputLines = File.ReadLines(inputFilePath);
var linesToWrite = inputLines
.Where(line => IsLineValid(line));
File.WriteAllLines(outputFilePath, linesToWrite);