Read Big TXT File, Out of Memory Exception - c#

I want to read big TXT file size is 500 MB,
First I use
var file = new StreamReader(_filePath).ReadToEnd();
var lines = file.Split(new[] { '\n' });
but it throw out of memory Exception then I tried to read line by line but again after reading around 1.5 million lines it throw out of memory Exception
using (StreamReader r = new StreamReader(_filePath))
{
while ((line = r.ReadLine()) != null)
_lines.Add(line);
}
or I used
foreach (var l in File.ReadLines(_filePath))
{
_lines.Add(l);
}
but Again I received
An exception of type 'System.OutOfMemoryException' occurred in
mscorlib.dll but was not handled in user code
My Machine is powerful machine with 8GB of ram so it shouldn't be my machine problem.
p.s: I tried to open this file in NotePadd++ and I received 'the file is too big to be opened' exception.

Just use File.ReadLines which returns an IEnumerable<string> and doesn't load all the lines at once to the memory.
foreach (var line in File.ReadLines(_filePath))
{
//Don't put "line" into a list or collection.
//Just make your processing on it.
}

The cause of exception seem to be growing _lines collection but not reading big file. You are reading line and adding to some collection _lines which will be taking memory and causing out of memory execption. You can apply filters to only put the required lines to _lines collection.

I know this is an old post but Google sent me here in 2021..
Just to emphasize igrimpe's comments above:
I've run into an OutOfMemoryException on StreamReader.ReadLine() recently looping through folders of giant text files.
As igrimpe mentioned, you can sometimes encounter this where your input file exhibits a lack of uniformity in line breaks. If you are looping through a textfile and encounter this, double check your input file for unexpected characters / ascii encoded hex or binary strings, etc.
In my case, I split the 60 gb problematic file into 256mb chunks, had my file iterator stash the problematic textfiles as part of the exception trap and later remedied the problem textfiles by removing the problematic lines.

Edit:
loading the whole file in memory will be causing objects to grow, and .net will throw OOM exceptions if it cannot allocate enough contiguous memory for an object.
The answer is still the same, you need to stream the file, not read the entire contents. That may require a rearchitecture of your application, however using IEnumerable<> methods you can stack up business processes in different areas of the applications and defer processing.
A "powerful" machine with 8GB of RAM isn't going to be able to store a 500GB file in memory, as 500 is bigger than 8. (plus you don't get 8 as the operating system will be holding some, you can't allocate all memory in .Net, 32-bit has a 2GB limit, opening the file and storing the line will hold the data twice, there is an object size overhead....)
You can't load the whole thing into memory to process, you will have to stream the file through your processing.

You have to count the lines first.
It is slower, but you can read up to 2,147,483,647 lines.
int intNoOfLines = 0;
using (StreamReader oReader = new
StreamReader(MyFilePath))
{
while (oReader.ReadLine() != null) intNoOfLines++;
}
string[] strArrLines = new string[intNoOfLines];
int intIndex = 0;
using (StreamReader oReader = new
StreamReader(MyFilePath))
{
string strLine;
while ((strLine = oReader.ReadLine()) != null)
{
strArrLines[intIndex++] = strLine;
}
}

For anyone else having this issue:
If you're running out of memory while using StreamReader.ReadLine(), I'd be willing to bet your file doesn't have multiple lines to begin with. You're just assuming it does. It's an easy mistake to make because you can't just open a 10GB file with Notepad.
One time I received a 10GB file from a client that was supposed to be a list of numbers and instead of using '\n' as a separator, he used commas. The whole file was a single line which obviously caused ReadLine() to blow up.
Try reading a few thousand characters from your stream using StreamReader.Read() and see if you can find a '\n'. Odds are you won't.

Related

How to convert huge text file from Unix to Windows quickly in .NET core

In .Net core, I have huge text files that need to be converted from Unix to Windows.
Since I can't load the file completly in memory (the files are too big), I read each byte one after the other, and when I encounter a LF, I output a LF+CR. This process works, but it takes a long time for huge files. Is there a more efficently way to do?
I thought about using a StreamReader, but the problem I'm having is that we don't know the source file encoding.
Any idea?
Thank you
Without knowing more about the specific files you're trying to process, I'd probably start off with something like the below and see if that gets me the results I want.
Depending on the specifics of your situation you may be able to do something more efficient, but if you're handling truly large datasets with unstructured text then it's usually a matter of throwing more powerful hardware at the problem if speed is still an issue.
You don't have to specify the Encoding to make use of the StreamReader class. Was there a specific problem with the reader you encountered?
const string inputFilePath = "";
const string outputFilePath = "";
using var sr = new StreamReader(inputFilePath);
using var sw = new StreamWriter(outputFilePath);
string line;
// Buffers each line into memory, but not the newline characters.
while ((line = await sr.ReadLineAsync()) != null)
{
// Write the contents of the string out to the "fixed" file (manually
// specifying the line ending you want).
await sw.WriteAsync(line + "\r\n");
}

Input from any file in WPF Application is very Slow

I have a WPF Application which takes an input file path from user and then at the backend open the text file and try to read single character from the file.
fs = File.OpenRead(fileName);
var sr = new StreamReader(fs);
int c;
while ((c = sr.Read()) != -1)
{
Console.Write((char)c); //to check character read from file
try
{
frequencyMap.Add((char)c, 1);
}
catch
{
frequencyMap[(char)c] += 1;
}
}
Here frequencyMap is the dictionary in which character and it's frequency is stored.
This is one method no matter whatever i do the reading from file is always slow even if i try to read the whole text. On output window i see
Area selected is the part of input from the file.
Files upto 2KBs are fine but reading from files like 20KB really gives a hard time.
Now I read that using threads can solve this problem i just don't know how.
My Question is how can i read data from files fastly? if using threads is the solution then how to implement it?
i am new to this so kindly help me.
Thanks
Don't read it by character, read it for example by line, and process each string in a loop. Also Exception is not a way to check if the key exists in the Dictionary.
using (var sr = new StreamReader(fileName))
{
while (!sr.EndOfStream)
{
string s = sr.ReadLine();
Debug.WriteLine(s); //to check string read from file
foreach (char c in s)
{
if (frequencyMap.ContainsKey(c))
frequencyMap[c]++;
else
frequencyMap.Add(c, 1);
}
}
}
Firstly I hope the Console.WrieLine is purely test code. Writing to the console for every character will slow down your processing considerably.
Secondly, it appears from the screen shot you shared that your application is throwing a lot of exceptions. Throwing exceptions is not cheap either in a tight loop.
Thirdly I would recommend you profile your application (visual studio provides a profiler) to help you pin point where exactly your application is spending it’s time.

How do I remove blank lines from text File in c#.net

I want to remove blank lines from my file, foe that I am using code below.
private void ReadFile(string Address)
{
var tempFileName = Path.GetTempFileName();
try
{
//using (var streamReader = new StreamReader(Server.MapPath("~/Images/") + FileName))
using (var streamReader = new StreamReader(Address))
using (var streamWriter = new StreamWriter(tempFileName))
{
string line;
while ((line = streamReader.ReadLine()) != null)
{
if (!string.IsNullOrWhiteSpace(line))
streamWriter.WriteLine(line);
}
}
File.Copy(tempFileName, Address, true);
}
finally
{
File.Delete(tempFileName);
}
Response.Write("Completed");
}
But the problem is my file is too large (8 lac lines ) so its taking lot of time. So is there any other way to do it faster?
Instead of doing a ReadLine(), I would do a StreamReader.ReadToEnd() to load the entire file into memory, then do a line.Replace("\n\n","\n") and then do a streamWrite.Write(line) to the file. That way there is not a lot of thrashing, either memory or disk, going on.
The best solution may well depend on the disk type - SSDs and spinning rust behave differently. Your current approach has the advantage over Steve's answer of being able to do processing (such as encoding text data back as binary) while data is still coming off the disk. (With buffering and background IO, there's a lot of potential asynchrony here.) It's definitely worth trying both approaches. (Obviously your approach uses less memory, too.)
However, there's one aspect of your code which is definitely suboptimal: creating a copy of the results. You don't need to do that. You can use file moves instead which are a lot more efficient, assuming they're all in the same drive. To make sure you don't lose data, you can do two moves and a delete:
Move the old file to a backup filename
Move the new file to the old filename
Delete the backup filename
It looks like this is what File.Replace does for you, which makes it considerably simpler, and also preserves the original metadata.
If something goes wrong after the first move, you're left without the "proper" file from either old or new, but you can detect that and use the backup filename to read next time.
Of course, if this is meant to happen as part of a web request, you may want to do all the processing in a background task - processing 800,000 lines of text is likely to take longer than you really want a web request to take...

Get all lines after the last print of line with 'keyword' in C#

I am working on a c# project.
I am trying to send a logfile via email whenever application gets crashed.
however logfile is a little bit larger in size.
So I thought that i should include only a specific portion of logfile.
For that I am trying to read all the lines after the last instance of line with specified keyword.(in my case "Application Started")
since Application get restarted many times(due to crashing), 'Application Started' gets printed many times in file. So I would only want last print of line containing 'Application Started' & lines after that until end of file.
I require help to figure out how can i do this.
I have just started with Basic code as of now.
System.IO.StreamReader file = new System.IO.StreamReader("c:\\mylogfile.txt");
while((line = file.ReadLine()) != null)
{
if ( line.Contains("keyword") )
{
}
}
Read the file, line-by-line, until you find your keyword. Once you find your keyword, start pushing every line after that into a List<string>. If you find another line with your keyword, just Clear your list and start refilling it from that point.
Something like:
List<string> buffer = new List<string>();
using (var sin = new StreamReader("pathtomylogfile"))
{
string line;
bool read;
while ((line = sin.ReadLine())!=null)
{
if (line.Contains("keyword"))
{
buffer.Clear();
read = true;
}
if (read)
{
buffer.Add(line);
}
}
// now buffer has the last entry
// you could use string.Join to put it back together in a single string
var lastEntry = string.Join("\n",buffer);
}
If the number of lines in each entry is very large, it might be more efficient to scan the file first to find the last entry and then loop again to extract it. If the whole log file isn't that large, it might be more efficient to just ReadToEnd and then use LastIndexOf to find the start of the last entry.
Read everything from the file and then select the portion you want.
string lines = System.IO.File.ReadAllText("c:\\mylogfile.txt");
int start_index = lines.LastIndexOf("Application Started");
string needed_portion = lines.Substring(start_index);
SendEmail(needed_portion);
I advise you to use a proper logger, like log4net or NLogger.
You can configure it to save to multiple files - one containing complete logs, other containing errors/exceptions only. Also you can set maximum size of log files, etc. Or can configure them to send you a mail if exception occours.
Of course this does not solves your current problem, for it there is some solution above.
But I would try simpler methods, like trying out Notepad++ - it can handle bigger files (last time i've formatted a 30MB XML document with it, it took about 20 mins, but he did it! With simple text files there should be much better perf.). Or if you open the file for reading only (not for editing) you may get much better performance (in Windows).

C# - Read in a large (150MB) text file into a Rich Text Box

I'm trying to read in a 150mb text file into a Rich Text box.
Currently, I am using a StreamReader to iterate through each line in the file, appending every line to a StringBuilder instance.
This works for smaller files, but I get a System.OutOfMemory exception when trying to read large files.
I don't see any problems with reading a 150mb file - there is plenty of physical memory and that's well within the Windows 32-bit application address space.
If anyone here has any idea how to do this, It would be greatly appreciated.
I'll attach my code at the end.
Thanks.
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(fileLocation))
{
string line;
while ((line = sr.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
return sb;
Use RichTextBox.LoadFile
http://msdn.microsoft.com/en-us/library/system.windows.forms.richtextbox.loadfile.aspx
I'm not sure why you would want to load the entire text to a StringBuilder. Alternatively you could pass a FileStream to LoadFile which would render the large file for you.
I guess you should manage somehow the input file - let say split it into several less files and navigate between the parts programmatically or so..
150MB file sounds like an abnormal thing. Maybe you should look at the stream kind of data processing rather than file one.

Categories

Resources