File.ReadAllText leading to memory leak [duplicate]

File.ReadAllText leading to memory leak [duplicate] - c#

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
TextBox.Text Leaking Memory in WPF Application
I've got an application trailing a logfile. Every time the logfile updates (which is usually a series of updates in a row) the memory use balloons out of control.
I've tracked down the problem to this call:
if (File.Exists(Path + "\\logfile.txt"))
Data = File.ReadAllText(Path + "\\logfile.txt");
This is being called from within LoadAllData, here.
private void FileChangeNotificationHandler(object source, FileSystemEventArgs e)
{
this.Dispatcher.BeginInvoke
(new Action(delegate()
{
Logfile.GetPath();
Logfile.LoadAllData();
LogText.Clear();
LogText.Text = Logfile.Data;
if (CheckFollowTail.IsChecked == true) LogText.ScrollToEnd();
}));
}
Does anyone have insight on why this occurring? I assume it's related to the delegate or the handler.

It's probably just down to the amount and frequency with which you are loading log file data into memory.
GC takes time, so it you are repeating this in quick succession, then chances are you'll have several files worth of data in memory until the next GC. This seems very inefficient. You should consider the use of a stream based reader, to avoid keeping all the data in memory. If you do use a stream reader, make sure you dispose of it afterwards to avoid introducing another leak.
The another thing to check it that your not subscribing to a static event somewhere and therefore preventing your object tree from being disposed. Is it a web app?

First of all, checking if the file exists is wrong. This is because the file system is volatile and because there is more than just existence at play (permissions, for example). The correct way to do this is to just open the file, and then handle the exception if it fails.
Now, on to your stated problem. What I suspect is happening is that the log is growing large enough to use the Large Object Heap (85000 bytes is all that's needed, iirc, and remember that .Net uses utf16 (2-byte) characters). A 43K ascii log file is all you'll need to start causing problems, because at that size your .Net string is no longer garbage collected in the normal way. Every time you read the file you end up adding another instance of the entire log file to memory.
To best recommend how to get around this, it will be helpful to know what kind of component you use for your LogText variable. But pending that information, I can at least suggest a few pointers:
Ideally, you would just keep the file open (using FileShare.ReadWrite) and read from the stream every time you get a change notification. But that's not always possible.
If you have to re-open the file each time, at least read the text line by line (using a StreamReader) rather than pulling it all at once using File.ReadAllLines(). This will help you keep your log file broken up into smaller pieces that won't end up on the large object heap.
Unfortunately, I suspect that in the end you're stuck building one big string to assign to a plain textbox. If this is the case, I strongly recommend that you either only ever build and show the last part of the log (less than 85000 bytes worth) or that you search for a Large Object Heap-safe Textbox component to use.

Related

Process very large XML file

I need to process an XML file with the following structure:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.
Here are the things I have tried:
Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
This approach worked better in that it didn't use as much memory, but I still had a few problems:
It was taking really long because I was reading the file one line at a time.
Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.
Here are more things I think I should try:
Re-write my C# application to use unmanaged code
I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.
I thought I'd ask for some advice before I go any further, possibly wasting my time.
Thanks in advance for you time and assistance.
EDIT
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:
That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
This is the method that runs during the size calculation:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :(

you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article
also, your question has already been asked here

appending and reading text file

Environment: Any .Net Framework welcomed.
I have a log file that gets written to 24/7.
I am trying to create an application that will read the log file and process the data.
What's the best way to read the log file efficiently? I imagine monitoring the file with something like FileSystemWatcher. But how do I make sure I don't read the same data once it's been processed by my application? Or say the application aborts for some unknown reason, how would it pick up where it left off last?
There's usually a header and footer around the payload that's in the log file. Maybe an id field in the content as well. Not sure yet though about the id field being there.
I also imagined maybe saving the lines read count somewhere to maybe use that as bookmark.

For obvious reasons reading the whole content of the file, as well as removing lines from the log files (after loading them into your application) is out of question.
What I can think of as a partial solution is having a small database (probable something much smaller than a full-blown MySQL/MS SQL/PostgreSQL instance) and populating table with what has been read from the log file. I am pretty sure that even if there is power cut off and then the machine is booted again, most of the relational databases should be able to restore it's state with ease. This solution requires some data that could be used to identify the row from the log file (for example: exact time of the action logged, machine on which the action has taken place etc.)

Well, you will have to figure out your magic for your particular case yourself. If you are going to use well-known text encoding it may be pretty simple thoght. Look toward System.IO.StreamReader and it's ReadLine(), DiscardBufferedData() methods and BaseStream property. You should be able to remember your last position in the file and rewind to that position later and start reading again, given that you are sure that file is only appended. There are other things to consider though and there is no single universal answer to this.
Just as a naive example (you may still need to adjust a lot to make it work):
static void Main(string[] args)
{
string filePath = #"c:\log.txt";
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var streamReader = new StreamReader(stream,Encoding.Unicode))
{
long pos = 0;
if (File.Exists(#"c:\log.txt.lastposition"))
{
string strPos = File.ReadAllText(#"c:\log.txt.lastposition");
pos = Convert.ToInt64(strPos);
}
streamReader.BaseStream.Seek(pos, SeekOrigin.Begin); // rewind to last set position.
streamReader.DiscardBufferedData(); // clearing buffer
for(;;)
{
string line = streamReader.ReadLine();
if( line==null) break;
ProcessLine(line);
}
// pretty sure when everything is read position is at the end of file.
File.WriteAllText(#"c:\log.txt.lastposition",streamReader.BaseStream.Position.ToString());
}
}
}

I think you will find the File.ReadLines(filename) function in conjuction with LINQ will be very handy for something like this. ReadAllLines() will load the entire text file into memory as a string[] array, but ReadLines will allow you to begin enumerating the lines immediately as it traverses through the file. This not only saves you time but keeps the memory usage very low as it is processing each line one at a time. Using statements are important because if this program is interrupted it will close the filestreams flushing the writer and saving unwritten content to the file. Then when it starts up it will skip all the files that are already read.
int readCount = File.ReadLines("readLogs.txt").Count();
using (FileStream readLogs = new FileStream("readLogs.txt", FileMode.Append))
using (StreamWriter writer = new StreamWriter(readLogs))
{
IEnumerable<string> lines = File.ReadLines(bigLogFile.txt).Skip(readCount);
foreach (string line in lines)
{
// do something with line or batch them if you need more than one
writer.WriteLine(line);
}
}
As MaciekTalaska mentioned, I would strongly recommend using a database if this is something written to 24/7 and will get quite large. File systems are simply not equipped to handle such volume and you will spend a lot of time trying to invent solutions where a database could do it in a breeze.

Is there a reason why it logs to a file? Files are great because they are simple to use and, being the lowest common denominator, there is relatively little that can go wrong. However, files are limited. As you say, there's no guarantee a write to the file will be complete when you read the file. Multiple applications writing to the log can interfere with each other. There is no easy sorting or filtering mechanism. Log files can grow very big very quickly and there's no easy way to move old events (say those more than 24 hours old) into separate files for backup and retention.
Instead, I would considering writing the logs to a database. The table structure can be very simple but you get the advantage of transactions (so you can extract or backup with ease) and search, sort and filter using an almost universally understood syntax. If you are worried about load spikes, use a message queue, like http://msdn.microsoft.com/en-us/library/ms190495.aspx for SQL Server.
To make the transition easier, consider using a logging framework like log4net. It abstracts much of this away from your code.
Another alternative is to use a system like syslog or, if you have multiple servers and a large volume of logs, flume. By moving the log files away the source computer, you can store them or inspect them on a different machine far more effectively. However, these are probably overkill for your current problem.

Why does appending to TextBox.Text during a loop take up more memory with each iteration?

Short Question
I have a loop that runs 180,000 times. At the end of each iteration it is supposed to append the results to a TextBox, which is updated real-time.
Using MyTextBox.Text += someValue is causing the application to eat huge amounts of memory, and it runs out of available memory after a few thousand records.
Is there a more efficient way of appending text to a TextBox.Text 180,000 times?
Edit I really don't care about the result of this specific case, however I want to know why this seems to be a memory hog, and if there is a more efficient way to append text to a TextBox.
Long (Original) Question
I have a small app which reads a list of ID numbers in a CSV file and generates a PDF report for each one. After each pdf file is generated, the ResultsTextBox.Text gets appended with the ID Number of the report that got processed and that it was successfully processed. The process runs on a background thread, so the ResultsTextBox gets updated real-time as items get processed
I am currently running the app against 180,000 ID numbers, however the memory the application is taking up is growing exponentially as time goes by. It starts by around 90K, but by about 3000 records it is taking up roughly 250MB and by 4000 records the application is taking up about 500 MB of memory.
If I comment out the update to the Results TextBox, the memory stays relatively stationary at roughly 90K, so I can assume that writing ResultsText.Text += someValue is what is causing it to eat memory.
My question is, why is this? What is a better way of appending data to a TextBox.Text that doesn't eat memory?
My code looks like this:
try
{
report.SetParameterValue("Id", id);
report.ExportToDisk(ExportFormatType.PortableDocFormat,
string.Format(#"{0}\{1}.pdf", new object[] { outputLocation, id}));
// ResultsText.Text += string.Format("Exported {0}\r\n", id);
}
catch (Exception ex)
{
ErrorsText.Text += string.Format("Failed to export {0}: {1}\r\n",
new object[] { id, ex.Message });
}
It should also be worth mentioning that the app is a one-time thing and it doesn't matter that it is going to take a few hours (or days :)) to generate all the reports. My main concern is that if it hits the system memory limit, it will stop running.
I'm fine with leaving the line updating the Results TextBox commented out to run this thing, but I would like to know if there is a more memory efficient way of appending data to a TextBox.Text for future projects.

I suspect the reason the memory usage is so large is because textboxes maintain a stack so that the user can undo/redo text. That feature doesn't seem to be required in your case, so try setting IsUndoEnabled to false.

Use TextBox.AppendText(someValue) instead of TextBox.Text += someValue. It's easy to miss since it's on TextBox, not TextBox.Text. Like StringBuilder, this will avoid creating copies of the entire text each time you add something.
It would be interesting to see how this compares to the IsUndoEnabled flag from keyboardP's answer.

Don't append directly to the text property. Use a StringBuilder for the appending, then when done, set the .text to the finished string from the stringbuilder

Instead of using a text box I would do the following:
Open up a text file and stream the errors to a log file just in case.
Use a list box control to represent the errors to avoid copying potentially massive strings.

Personally, I always use string.Concat* . I remember reading a question here on Stack Overflow years ago that had profiling statistics comparing the commonly-used methods, and (seem) to recall that string.Concat won out.
Nonetheless, the best I can find is this reference question and this specific String.Format vs. StringBuilder question, which mentions that String.Format uses a StringBuilder internally. This makes me wonder if your memory hog lies elsewhere.
**based on James' comment, I should mention that I never do heavy string formatting, as I focus on web-based development.*

Maybe reconsider the TextBox? A ListBox holding string Items will probably perform better.
But the main problem seem to be the requirements, Showing 180,000 items cannot be aimed at a (human) user, neither is changing it in "Real Time".
The preferable way would be to show a sample of the data or a progress indicator.
When you do want to dump it at the poor User, batch string updates. No user could descern more than 2 or 3 changes per second. So if you produce 100/second, make groups of 50.

Some responses have alluded to it, but nobody has outright stated it which is surprising.
Strings are immutable which means a String cannot be modified after it is created. Therefore, every time you concatenate to an existing String, a new String Object needs to be created. The memory associated with that String Object also obviously needs to be created, which can get expensive as your Strings become larger and larger. In college, I once made the amateur mistake of concatenating Strings in a Java program that did Huffman coding compression. When you're concatenating extremely large amounts of text, String concatenation can really hurt you when you could have simply used StringBuilder, as some in here have mentioned.

Use the StringBuilder as suggested.
Try to estimate the final string size then use that number when instantiating the StringBuilder. StringBuilder sb = new StringBuilder(estSize);
When updating the TextBox just use assignment eg: textbox.text = sb.ToString();
Watch for cross-thread operations as above. However use BeginInvoke. No need to block
the background thread while the UI updates.

A) Intro: already mentioned, use StringBuilder
B) Point: don't update too frequently, i.e.
DateTime dtLastUpdate = DateTime.MinValue;
while (condition)
{
DoSomeWork();
if (DateTime.Now - dtLastUpdate > TimeSpan.FromSeconds(2))
{
_form.Invoke(() => {textBox.Text = myStringBuilder.ToString()});
dtLastUpdate = DateTime.Now;
}
}
C) If that's one-time job, use x64 architecture to stay within 2Gb limit.

StringBuilder in ViewModel will avoid string rebindings mess and bind it to MyTextBox.Text. This scenario will increase performance many times over and decrease memory usage.

Something that has not been mentioned is that even if you're performing the operation in the background thread, the update of the UI element itself HAS to happen on the main thread itself (in WinForms anyway).
When updating your textbox, do you have any code that looks like
if(textbox.dispatcher.checkAccess()){
textbox.text += "whatever";
}else{
textbox.dispatcher.invoke(...);
}
If so, then your background op is definitely being bottlenecked by the UI Update.
I would suggest that your background op use StringBuilder as noted above, but instead of updating the textbox every cycle, try updating it at regular intervals to see if it increases performance for you.
EDIT NOTE:have not used WPF.

You say memory grows exponentially. No, it is a quadratic growth, i.e. a polynomial growth, which is not as dramatic as an exponential growth.
You are creating strings holding the following number of items:
1 + 2 + 3 + 4 + 5 ... + n = (n^2 + n) /2.
With n = 180,000 you get total memory allocation for 16,200,090,000 items, i.e. 16.2 billion items! This memory will not be allocated at once, but it is a lot of cleanup work for the GC (garbage collector)!
Also, bear in mind, that the previous string (which is growing) must be copied into the new string 179,999 times. The total number of copied bytes goes with n^2 as well!
As others have suggested, use a ListBox instead. Here you can append new strings without creating a huge string. A StringBuild does not help, since you want to display the intermediate results as well.

binarywriter not opening file at end of stream

I have a method which uses a binarywriter to write a record consisting of few uints and a byte array to a file. This method executes about a dozen times a second as part of my program. The code is below:
iLogFileMutex.WaitOne();
using (BinaryWriter iBinaryWriter = new BinaryWriter(File.Open(iMainLogFilename, FileMode.OpenOrCreate, FileAccess.Write)))
{
iBinaryWriter.Seek(0, SeekOrigin.End);
foreach (ViewerRecord vR in aViewerRecords)
{
iBinaryWriter.Write(vR.Id);
iBinaryWriter.Write(vR.Timestamp);
iBinaryWriter.Write(vR.PayloadLength);
iBinaryWriter.Write(vR.Payload);
}
}
iLogFileMutex.ReleaseMutex();
The above code works fine, but if I remove the line with the seek call, the resulting binary file is corrupted. For example certain records are completely missing or parts of them are just not present although the vast majority of records are written just fine. So I imagine that the cause of the bug is when I repeatedly open and close the file the current position in the file isn't always at the end and things get overwritten.
So my question is: Why isn't C# ensuring that the current position is at the end when I open the file?
PS: I have ruled out threading issues from causing this bug

If you want to append to the file, you must use FileMode.Append in your Open call, otherwise the file will open with its position set to the start of the file, not the end.

The problem is a combination of FileMode.OpenOrCreate and the type of the ViewerRecord members. One or more of them isn't of a fixed size type, probably a string.
Things go wrong when the file already exists. You'll start writing data at the start of the file, overwriting existing data. But what you write only overwrites an existing record by chance, the string would have to be the exact same size. If you don't write enough records then you won't overwrite all of the old records. And get into trouble when you read the file, you'll read part of an old record after you've read the last written record. You'll get junk for a while.
Making the record a fixed size doesn't really solve the problem, you'll read a good record but it will be an old one. Which particular set of old records you'll get depends on how much new data you wrote. This should be just as bad as reading garbled data.
If you really do need to preserve the old records then you should append to the file, FileMode.Append. If you don't then you should rewrite the file, FileMode.Create.

Writing huge amounts of text to a textbox

I am writing a log of lots and lots of formatted text to a textbox in a .net windows form app.
It is slow once the data gets over a few megs. Since I am appending the string has to be reallocated every time right? I only need to set the value to the text box once, but in my code I am doing line+=data tens of thousands of times.
Is there a faster way to do this? Maybe a different control? Is there a linked list string type I can use?

StringBuilder will not help if the text box is added to incrementally, like log output for example.
But, if the above is true and if your updates are frequent enough it may behoove you to cache some number of updates and then append them in one step (rather than appending constantly). That would save you many string reallocations... and then StringBuilder would be helpful.
Notes:
Create a class-scoped StringBuilder member (_sb)
Start a timer (or use a counter)
Append text updates to _sb
When timer ticks or certain counter reached reset and append to
text box
restart process from #1

No one has mentioned virtualization yet, which is really the only way to provide predictable performance for massive volumes of data. Even using a StringBuilder and converting it to a string every half a second will be very slow once the log gets large enough.
With data virtualization, you would only hold the necessary data in memory (i.e. what the user can see, and perhaps a little more on either side) whilst the rest would be stored on disk. Old data would "roll out" of memory as new data comes in to replace it.
In order to make the TextBox appear as though it has a lot of data in it, you would tell it that it does. As the user scrolls around, you would replace the data in the buffer with the relevant data from the underlying source (using random file access). So your UI would be monitoring a file, not listening for logging events.
Of course, this is all a lot more work than simply using a StringBuilder, but I thought it worth mentioning just in case.

Build your String together with a StringBuilder, then convert it to a String using toString(), and assign this to the textbox.

I have found that setting the textbox's WordWrap property to false greatly improves performance, as long as you're ok with having to scroll to the right to see all of your text. In my case, I wanted to paste a 20-50 MB file into a MultiLine textbox to do some processing on it. That took several minutes with WordWrap on, and just several seconds with WordWrap off.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

File.ReadAllText leading to memory leak [duplicate] - c#

Related

Process very large XML file

appending and reading text file

Why does appending to TextBox.Text during a loop take up more memory with each iteration?

binarywriter not opening file at end of stream

Writing huge amounts of text to a textbox

Categories

Resources