Edit2: I just want to make sure my question is clear: Why, on each iteration of AppendToLog(), the application uses 15mb more? (the size of the original log file)
I've got a function called AppendToLog() which receives the file path of an HTML document, does some parsing and appends it to a file. It gets called this way:
this.user_email = uemail;
string wanted_user = wemail;
string[] logPaths;
logPaths = this.getLogPaths(wanted_user);
foreach (string path in logPaths)
{
this.AppendToLog(path);
}
On every iteration, the RAM usage increases by 15mb or so. This is the function: (looks long but it's simple)
public void AppendToLog(string path)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-2");
StringBuilder fb = new StringBuilder();
FileStream sourcef;
string[] messages;
try
{
sourcef = new FileStream(path, FileMode.Open);
}
catch (IOException)
{
throw new IOException("The chat log is in use by another process."); ;
}
using (StreamReader sreader = new StreamReader(sourcef, enc))
{
string file_buffer;
while ((file_buffer = sreader.ReadLine()) != null)
{
fb.Append(file_buffer);
}
}
//Array of each line's content
messages = parseMessages(fb.ToString());
fb = null;
string destFileName = String.Format("{0}_log.txt",System.IO.Path.GetFileNameWithoutExtension(path));
FileStream destf = new FileStream(destFileName, FileMode.Append);
using (StreamWriter swriter = new StreamWriter(destf, enc))
{
foreach (string message in messages)
{
if (message != null)
{
swriter.WriteLine(message);
}
}
}
messages = null;
sourcef.Dispose();
destf.Dispose();
sourcef = null;
destf = null;
}
I've been days with this and I don't know what to do :(
Edit: This is ParseMessages, a function that uses HtmlAgilityPack to strip parts of an HTML log.
public string[] parseMessages(string what)
{
StringBuilder sb = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(what);
HtmlNodeCollection messageGroups = doc.DocumentNode.SelectNodes("//body/div[#class='mplsession']");
int messageCount = doc.DocumentNode.SelectNodes("//tbody/tr").Count;
doc = null;
string[] buffer = new string[messageCount];
int i = 0;
foreach (HtmlNode sessiongroup in messageGroups)
{
HtmlNode tablegroup = sessiongroup.SelectSingleNode("table/tbody");
string sessiontime = sessiongroup.Attributes["id"].Value;
HtmlNodeCollection messages = tablegroup.SelectNodes("tr");
if (messages != null)
{
foreach (HtmlNode htmlNode in messages)
{
sb.Append(
ParseMessageDate(
sessiontime,
htmlNode.ChildNodes[0].ChildNodes[0].InnerText
)
); //Date
sb.Append(" ");
try
{
foreach (HtmlTextNode node in htmlNode.ChildNodes[0].SelectNodes("text()"))
{
sb.Append(node.Text.Trim()); //Name
}
}
catch (NullReferenceException)
{
/*
* We ignore this exception, it just means there's extra text
* and that means that it's not a normal message
* but a system message instead
* (i.e. "John logged off")
* Therefore we add the "::" mark for future organizing
*/
sb.Append("::");
}
sb.Append(" ");
string message = htmlNode.ChildNodes[1].InnerHtml;
message = message.Replace(""", "'");
message = message.Replace(" ", " ");
message = RemoveMedia(message);
sb.Append(message); //Message
buffer[i] = sb.ToString();
sb = new StringBuilder();
i++;
}
}
}
messageGroups = null;
what = null;
return buffer;
}
As many have mentioned, this is probably just an artifact of the GC not cleaning up the memory storage as fast as you are expecting it to. This is normal for managed languages, like C#, Java, etc. You really need to find out if the memory allocated to your program is free or not if you're are interested in that usage. The questions to ask related to this are:
How long is your program running? Is it a service type program that runs continuously?
Over the span of execution does it continue to allocate memory from the OS or does it reach a steady-state? (Have you run it long enough to find out?)
Your code does not look like it will have a "memory-leak". In managed languages you really don't get memory leaks like you would in C/C++ (unless you are using unsafe or external libraries that are C/C++). What happens though is that you do need to watch out for references that stay around or are hidden (like a Collection class that has been told to remove an item but does not set the element of the internal array to null). Generally, objects with references on the stack (locals and parameters) cannot 'leak' unless you store the reference of the object(s) into an object/class variables.
Some comments on your code:
You can reduce the allocation/deallocation of memory by pre-allocating the StringBuilder to at least the proper size. Since you know you will need to hold the entire file in memory, allocate it to the file size (this will actually give you a buffer that is just a little bigger than required since you are not storing new-line character sequences but the file probably has them):
FileInfo fi = new FileInfo(path);
StringBuilder fb = new StringBuilder((int) fi.Length);
You may want to ensure the file exists before getting its length, using fi to check for that. Note that I just down-cast the length to an int without error checking as your files are less than 2GB based on your question text. If that is not the case then you should verify the length before casting it, perhaps throwing an exception if the file is too big.
I would recommend removing all the variable = null statements in your code. These are not necessary since these are stack allocated variables. As well, in this context, it will not help the GC since the method will not live for a long time. So, by having them you create additional clutter in the code and it is more difficult to understand.
In your ParseMessages method, you catch a NullReferenceException and assume that is just a non-text node. This could lead to confusing problems in the future. Since this is something you expect to normally happen as a result of something that may exist in the data you should check for the condition in the code, such as:
if (node.Text != null)
sb.Append(node.Text.Trim()); //Name
Exceptions are for exceptional/unexpected conditions in the code. Assigning significant meaning to NullReferenceException more than that there was a null reference can (likely will) hide errors in other parts of that same try block now or with future changes.
There is no memory leak. If you are using Windows Task Manager to measure the memory used by your .NET application you are not getting a clear picture of what is going on, because the GC manages memory in a complex way that Task Manager doesn't reflect.
A MS engineer wrote a great article about why .NET applications that seem to be leaking memory probably aren't, and it has links to very in depth explanations of how the GC actually works. Every .NET programmer should read them.
I would look carefully at why you need to pass a string to parseMessages, ie fb.ToString().
Your code comment says that this returns an array of each lines content. However you are actually reading all lines from the log file into fb and then converting to a string.
If you are parsing large files in parseMessages() you could do this much more efficiently by passing the StringBuilder itself or the StreamReader into parseMessages(). This would enable only loading a portion of the file into memory at any time, as opposed to using ToString() which currently forces the entire logfile into memory.
You are less likely to have a true memory leak in a .NET application thanks to garbage collection. You do not look to be using any large resources such as files, so it seems even less likely that you have an actual memory leak.
It looks like you have disposed of resources ok, however the GC is probably struggling to allocate and then deallocate the large memory chunks in time before the next iteration starts, and so you see the increasing memory usage.
While GC.Collect() may allow you to force memory deallocation, I would strongly advise looking into the suggestions above before resorting to trying to manually manage memory via GC.
[Update] Seeing your parseMessages() and the use of HtmlAgilityPack (a very useful library, by the way) it looks likely there are some large and possibly numerous allocations of memory being performed for every logile.
HtmlAgility allocates memory for various nodes internally, when combined with your buffer array and the allocations in the main function I'm even more confident that the GC is being put under a lot of pressure to keep up.
To stop guessing and get some real metrics, I would run ProcessExplorer and add the columns to show the GC Gen 0,1,2 collections columns. Then run your application and observe the number of collections. If you're seeing large numbers in these columns then the GC is struggling and you should redesign to use less memory allocations.
Alternatively, the free CLR Profiler 2.0 from Microsoft provides nice visual representation of .NET memory allocations within your application.
One thing you may want to try, is temporarily forcing a GC.Collect after each run. The GC is very intelligent, and will not reclaim memory until is feels the expense of a collection is worth the value of any recovered memory.
Edit: I just wanted to add that its important to understand that calling GC.Collect manually is a bad practice (for any normal use case. Abnormal == perhaps a load function for a game or somesuch). You should let the garbage collector decide whats best, as it will generally have more information than avaliable to you about system resources and the like on which to base its collection behaviour.
The try-catch block could use a finally (cleanup). If you look at what the using statement does, it is equivalent to try catch finally. Yes, running GC is a good idea also. Without compiling this code and giving it a try it is hard to say for sure ...
Also, dispose this guy properly using a using:
FileStream destf = new FileStream(destFileName, FileMode.Append);
Look up Effective C# 2nd edition
I would manually clear the array of message and the stringbuilder before the setting them to null.
edit
looking at what the process seem to do I got a suggestion, if it's not too late instead of parsing an html file.
create a dataset schemas and use that to write and read an xml log file and use a xsl file to convert it into an html file.
I don't see any obvious memory leaks; my first guess would be that it's something in the library.
A good tool to figure this kind of thing out is the .NET Memory Profiler, by SciTech. They have a free two-week trial.
Short of that, you could try commenting out some of the library functions, and see if the problem goes away if you just read the files and do nothing with the data.
Also, where are you looking for memory use stats? Keep in mind that the stats reported by Task Manager aren't always very useful or reflective of actual memory use.
HtmlDocument class (as far as I can determin) has a serious memory leak when used from managed code. I reccomend using the XMLDOM parser instead (though this does require well formed documents, but thats another +).
Related
I have data stored in several seperate text files that I parse and analyze afterwards.
The size of the data processed differs a lot. It ranges from a few hundred megabytes (or less) to 10+ gigabytes.
I started out with storing the parsed data in a List<DataItem> because I wanted to perform a BinarySearch() during the analysis. However, the program throws an OutOfMemory-Exception if too much data is parsed. The exact amount the parser can handle depends on the fragmentation of the memory. Sometimes it's just 1.5 gb of the files and some other time it's 3 gb.
Currently I'm using a List<List<DataItem>> with a limited number of entries because I thought it would change anything for the better. There weren't any significant improvements though.
Another way I tried was serializing the parser data and than deserializing it if needed. The result of that approach was even worse. The whole process took much longer.
I looked into memory mapped files but I don't really know if they could help me because I never used them before. Would they?
So how can I quickly access the data from all the files without the danger of throwing an OutOfMemoryException and find DataItems depending on their attributes?
EDIT: The parser roughly works like this:
void Parse() {
LoadFile();
for (int currentLine = 1; currentLine < MAX_NUMBER_OF_LINES; ++currentLine) {
string line = GetLineOfFile(currentLine);
string[] tokens = SplitLineIntoTokens(line);
DataItem data = PutTokensIntoDataItem(tokens);
try {
List<DataItem>.Add(data);
} catch (OutOfMemoryException ex) {}
}
}
void LoadFile(){
DirectoryInfo di = new DirectroyInfo(Path);
FileInfo[] fileList = di.GetFiles();
foreach(FileInfo fi in fileList)
{
//...
StreamReader file = new SreamReader(fi.FullName);
//...
while(!file.EndOfStram)
strHelp = file.ReadLine();
//...
}
}
There is no right answer for this I believe. The implementation depends on many factors that only you can rate pros and cons on.
If your primary purpose is to parse large files and large number of them, keeping these in memory irrespective of how much RAM is available should be a secondary option, for various reasons for e.g. like persistance at times when an unhandled exception occured.
Although when profiling under initial conditions you may be encouraged and inclined to load them to memory retain for manipulation and search, this will soon change as the number of files increase and in no time your application supporters will start ditching this.
I would do the below
Read and store each file content to a document database like Raven DB for e.g.
Perform parse routine on these documents and store the relevant relations in an rdbms db if that is the requirement
Search at will, fulltext or otherwise, on either the document db (raw) or relational (your parse output)
By doing this, you are taking advantage of research done by the creators of these systems in managing the memory efficiently with focus on performance
I realise that this may not be the answer for you, but for someone who may think this is better and suits perhaps yes.
If the code in your question is representative of the actual code, it looks like you're reading all of the data from all of the files into memory, and then parsing. That is, you have:
Parse()
LoadFile();
for each line
....
And your LoadFile loads all of the files into memory. Or so it seems. That's very wasteful because you maintain a list of all the un-parsed lines in addition to the objects created when you parse.
You could instead load only one line at a time, parse it, and then discard the unparsed line. For example:
void Parse()
{
foreach (var line in GetFileLines())
{
}
}
IEnumerable<string> GetFileLines()
{
foreach (var fileName in Directory.EnumerateFiles(Path))
{
foreach (var line in File.ReadLines(fileName)
{
yield return line;
}
}
}
That limits the amount of memory you use to hold the file names and, more importantly, the amount of memory occupied by un-parsed lines.
Also, if you have an upper limit to the number of lines that will be in the final data, you can pre-allocate your list so that adding to it doesn't cause a re-allocation. So if you know that your file will contain no more than 100 million lines, you can write:
void Parse()
{
var dataItems = new List<DataItem>(100000000);
foreach (var line in GetFileLines())
{
data = tokenize_and_build(line);
dataItems.Add(data);
}
}
This reduces fragmentation and out of memory errors because the list is pre-allocated to hold the maximum number of lines you expect. If the pre-allocation works, then you know you have enough memory to hold references to the data items you're constructing.
If you still run out of memory, then you'll have to look at the structure of your data items. Perhaps you're storing too much information in them, or there are ways to reduce the amount of memory used to store those items. But you'll need to give us more information about your data structure if you need help reducing its footprint.
You can use:
Data Parallelism (Task Parallel Library)
Write a Simple Parallel.ForEach
I think it will make it will reduce memory exception and make files handling faster.
I've found a memory leak in my parser. I don't know how to fix that problem.
Let's see that basic routing.
private void parsePage() {
String[] tmp = null;
foreach (String row in rows) {
tmp = row.Split(new []{" "}, StringSplitOptions.None);
PrivateRow t = new PrivateRow();
t.Field1 = tmp[1];
t.Field2 = tmp[2];
t.Field3 = tmp[3];
t.Field4 = String.Join(" ", tmp);
myBigCollection.Add(t);
}
}
private void parseFromFile() {
String[] tmp = null;
foreach (String row in rows) {
PrivateRow t = new PrivateRow();
t.Field1 = "mystring1";
t.Field2 = "mystring2222";
t.Field3 = "mystring3333";
t.Field4 = "mystring1 xxx yy zzz";
myBigCollection.Add(t);
}
}
Launching parsePage(), on a collection (rows is a List of 100000 elements) make my app grown from 20MB to 70MB.
Launching parseFromFile(), that read SAME collection from file, but avoiding split/join, take about 1MB.
Using a MemoryProfiler, I see that "t" fields and PrivateRow, kkep reference to String.Split() array and Split.Join.
I suppose that's because I assign a reference, not a copy, that can be garbage collected.
Ok, use 70mb isn't a big deal, but when I launch on production, with a lot o site, it can raise 2.5-3GB...
Cheers
This isn't a memory leak per se. It's actually behaving properly. The reason your second function uses so much less memory, is simply because you only have four strings in use. Each of these four strings is allocated only once, and subsequent uses of the strings for new t.Fieldx instances actually refer to the same string values. Strings are immutable, so if you refer to the same string value more than once, it can be handled by the same string instances. See the paragraph labelled "Interning" at this article on String in .NET for some more detail on this.
In your first function, you have what are probably mostly different strings for each field, and each time through the loop. That simply is much more varied data. The fact that those strings are held on to is what you want to have happen for as long as your PrivateRow objects exist.
You don't have a memory leak at all, it's just garbage collector takes time to process it.
I suppose that's because I assign a reference, not a copy, that can
be garbage collected.
That is not correct assumption. string during assignment is copied, even if it is a reference type. It is special, kind of, unique type inside BCL.
Now what about possible solution, in case you have intensive memory pressure. If you have massive amount of string to process from file, you may look on 2 options.
1) Process them in sequence, by reading a srteam (not load all at once). Loading as less data in memory as possible/required/makes sence.
2) Use MemoryMappedFile to, again, load only chunks of data and process them in sequence.
2nd can be combined with 1st.
Like others have said, there is no evidence of a memory leak here, just delayed garbage collection. All memory should be cleaned up eventually.
That being said, there are a couple things you can do to help keep memory usage lower or recover it more quickly:
1)You should be able to replace
t.Field4 = String.Join(" ", tmp);
with
t.Field4 = row;
You created tmp by splitting row, then you're joining it back together. Avoid creating a new string by just using row.
2) Call GC.Collect(); at the end of the method to request immediate garbage collection. This won't reduce the memory used within the method, but it should free up memory more quickly.
If your application is memory-usage critical and there is a lot of repeating data you should replace string values with Enums.
I have a StringBuilder that appends all the pixel in an image, this amount being extremely large. Every time I run my program, everything goes well, but once I change a pixel color (ArGB) I get a OutOfMemoryException at the spot where I clear the StringBuilder. The problem is that I need to create an instance of StreamWriter then add my text to it THEN set the file path.| My current code it:
StringBuilder PixelFile = new StringBuilder("", 5000);
Private void Render()
{
//One second run, I get an OutOfMemoryException
PixelFile.Clear();
//This is in a for but cut it out for reverence.
PixelFile.Append(ArGBFormat);
}
I do not know what is causing this. I have tried PixelFile.Length = 0; and PixelFile.Capacity = 0;
OutOfMemory probably means you're building the string too big for StringBuilder, which is designed to handle a very different type of operation.
While I'm at a loss for how to make StringBuilder work, let me point you at a more intuitive implementation that will be less likely to fail.
You can read and write from a file using direct binary through the BinaryReader and BinaryWriter classes. This can also save you a lot of effort since you can make sure you're serializing bytes instead of character strings or entire words.
If you absolutely must use plaintext, consider the StreamReader and StreamWriter classes directly, as they won't throw exceptions for size. Remember, streams are intended for this sort of operation, StringBuilder is not, so Streams are far more likely to work with far less effort on your part.
EDIT:
When the maximum capacity is reached, no further memory can be allocated for the StringBuilder object, and trying to add characters or expand it beyond its maximum capacity throws either an ArgumentOutOfRangeException or an OutOfMemoryException exception.
Therefore, this is a limitation of the StringBuilder class and cannot be overcome with your current implementation.
EDIT: Additional implementation
In addition to StreamWriters which can write directly to files, you can also use the MemoryStream class to pipe information to memory instead of disk. Be aware this could lead to slow performance of the program, and I recommend instead trying to refactor the process to only need to perform a stream once.
That being said, it is still possible.
var mem = new MemoryStream();
var memWriter = new StreamWriter(mem);
// TODO: use memWriter.Write as per StreamWriter
mem.Position = 0; // This ensures you are copying your stream from the beginning
// TODO: Show your file save dialog
var fileStream = new StreamWriter(fileNameFromDialog);
mem.CopyTo(fileWriter); // Perform the copy
I am trying to export SQL table data to a text file with '~' delimiter in C# code.
When data is small its fine. When it's huge, it is throwing an Out of memory exception.
My Code:
public static void DataTableToTextFile(DataTable dtToText, string filePath)
{
int i = 0;
StreamWriter sw = null;
try
{
sw = new StreamWriter(filePath, false); /*For ColumnName's */
for (i = 0; i < dtToText.Columns.Count - 1; i++)
{
sw.Write(dtToText.Columns[i].ColumnName + '~');
}
sw.Write(dtToText.Columns[i].ColumnName + '~');
sw.WriteLine(); /*For Data in the Rows*/
foreach (DataRow row in dtToText.Rows)
{
object[] array = row.ItemArray;
for (i = 0; i < array.Length - 1; i++)
{
sw.Write(array[i].ToString() + '~');
}
sw.Write(array[i].ToString() + '~');
sw.WriteLine();
}
sw.Close();
}
catch (Exception ex)
{
throw new Exception("");
}
}
Is there a better way to do this in a stored procedure or BCP command?
If there's no specific reason for using the ~ delimiter format, you might try using the DataTable WriteXml function (http://msdn.microsoft.com/en-us/library/system.data.datatable.writexml.aspx)
For example:
dtToText.WriteXml("c:\data.xml")
If you need to convert this text back to a DataTable later you can use ReadXml (http://msdn.microsoft.com/en-us/library/system.data.datatable.readxml.aspx)
If you really need to make the existing code work, I'd probably try closing and calling Dispose on the StreamWriter at a set interval, then reopen and append to the existing text.
I realize that this question is years old, but I recently experienced a similar problem. The solution: Briefly, I think that you're running into problems with the Windows Large Object Heap. A relevant link:
https://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/
To summarize the above article: When you allocate chunks of memory more than 85K long (which seems likely to happen behind the scenes in your StreamWriter object if the values in your DataTable are large enough), they go onto a separate heap, the Large Object Heap (LOH). Memory chunks in the LOH are deallocated normally when their lifetime expires, but the heap is not compacted. The net result is that a System.OutOfMemoryException is thrown, not because there isn't actually enough memory, but because there isn't enough contiguous memory in the heap at some point.
If you're using .NET framework 4.5.1 or later (which won't work on Visual Studio 2010 or before; it might work on VS2012), you can use this command:
System.Runtime.GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
This command forces LOH compaction to happen at the next garbage collection. Just put that command as the first line in your function; it will be set to CompactOnce every time this function is called, which will cause LOH compaction at some indeterminate point after the function is called.
If you don't have .NET 4.5.1, it gets uglier. The problem is that the memory allocation isn't explicit; it's happening behind the scenes in your StreamWriter, most likely. Try calling GC.Collect(), forcing garbage collection, from time to time--perhaps every 3rd time this function is called.
A warning: Lots of people will advise you that calling GC.Collect() directly is a bad idea and will slow down your application--and they're right. I just don't know a better way to handle this problem.
I am creating a downloading application and I wish to preallocate room on the harddrive for the files before they are actually downloaded as they could potentially be rather large, and noone likes to see "This drive is full, please delete some files and try again." So, in that light, I wrote this.
// Quick, and very dirty
System.IO.File.WriteAllBytes(filename, new byte[f.Length]);
It works, atleast until you download a file that is several hundred MB's, or potentially even GB's and you throw Windows into a thrashing frenzy if not totally wipe out the pagefile and kill your systems memory altogether. Oops.
So, with a little more enlightenment, I set out with the following algorithm.
using (FileStream outFile = System.IO.File.Create(filename))
{
// 4194304 = 4MB; loops from 1 block in so that we leave the loop one
// block short
byte[] buff = new byte[4194304];
for (int i = buff.Length; i < f.Length; i += buff.Length)
{
outFile.Write(buff, 0, buff.Length);
}
outFile.Write(buff, 0, f.Length % buff.Length);
}
This works, well even, and doesn't suffer the crippling memory problem of the last solution. It's still slow though, especially on older hardware since it writes out (potentially GB's worth of) data out to the disk.
The question is this: Is there a better way of accomplishing the same thing? Is there a way of telling Windows to create a file of x size and simply allocate the space on the filesystem rather than actually write out a tonne of data. I don't care about initialising the data in the file at all (the protocol I'm using - bittorrent - provides hashes for the files it sends, hence worst case for random uninitialised data is I get a lucky coincidence and part of the file is correct).
FileStream.SetLength is the one you want. The syntax:
public override void SetLength(
long value
)
If you have to create the file, I think that you can probably do something like this:
using (FileStream outFile = System.IO.File.Create(filename))
{
outFile.Seek(<length_to_write>-1, SeekOrigin.Begin);
OutFile.WriteByte(0);
}
Where length_to_write would be the size in bytes of the file to write. I'm not sure that I have the C# syntax correct (not on a computer to test), but I've done similar things in C++ in the past and it's worked.
Unfortunately, you can't really do this just by seeking to the end. That will set the file length to something huge, but may not actually allocate disk blocks for storage. So when you go to write the file, it will still fail.