I want to create a hex editor to open large binary files.
This is my code. It works well for small files. But when I open large files, Hex editor faces problem.
data[] ... array of byte
string str = "";
byte[] temp = null;
int i;
for (i = 0; i < (data.Length - 16); i += 16)
{
temp = _sub_array(data, i, 16);
str += BitConverter.ToString(temp).Replace("-", "\t");
str += "\n";
}
temp = _sub_array(data, i, (data.Length - i));
str += BitConverter.ToString(temp).Replace("-", "\t");
richTextBox.Text = str;
As has been said in the comments, you should try to avoid reading in the entire file at once. However, if you need the entire file in memory at once, I think your main problem might be the "stickiness" that the program will experience while reading and converting data. You are wiser to use a separate thread for the hex work and let the main thread focus on keeping your UI operating smoothly. You could also use tasks instead of threads, either way. So using your code snippet, make it look more like this:
data[] ... array of byte
private void button1_Click(object sender, EventArgs e)
{
Thread t = new Thread(readHexFile);
t.Start();
}
private void readHexFile()
{
string str = "";
byte[] temp = null;
int i;
for (i = 0; i < (data.Length - 16); i += 16)
{
temp = _sub_array(data, i, 16);
str += BitConverter.ToString(temp).Replace("-", "\t");
str += "\n";
}
temp = _sub_array(data, i, (data.Length - i));
str += BitConverter.ToString(temp).Replace("-", "\t");
BeginInvoke(new Action(()=> richTextBox.Text = str));
}
You'll need to add "using System.Threading" to get access to threads. Also note the BeginInvoke with the richTextBox.Text work in a lambda expression. This is necessary when you run the data processing on a separate thread because if you try to access the textbox directly with that thread, Windows will complain about a cross-thread call. Only the thread that made the control is allowed to access it directly. BeginInvoke doesn't access the control directly, so you can use it from the data processing thread to get text written to the control. This will stop the data processing from "gumming up" the UI responsiveness.
This may seem intimidating at first if you have never done it, but trust me. If you get the hang of Threads and Tasks (which are different inside the machine but can be manipulated by similar developer tools) you will never want to render to the UI from the main thread again.
EDIT: I left the string from your code as it was, but I agree with the comment suggesting StringBuilder instead. Strings are immutable, so each time you concatenate to the string, internally what's happening is that the whole string is being scrapped and a new one is being made with the additional text. So yeah, do switch to a StringBuilder object as well.
So you've got working code for small files, but you face problems with large files. You don't mention what those problems are so here are a few guesses:
If you're loading the entire file into a byte[], then you could have a memory issue and possibly throw an OutOfMemoryException
You're concatenating a string repeatedly. This is not only a memory issue, but a performance one too (Reference Jon Skeet's article http://www.yoda.arachsys.com/csharp/stringbuilder.html)
You're _sub_array() is called repeatedly and returns a 16 length byte[], yet another memory and performance issue.
You call String.Replace() repeatedly (See bullet 2).
I consider these to be memory problems because we don't know when the Garbage Collector will clean up the memory.
So let's address these potential problems:
Read your file 16 bytes at a time (#EZI comment), this also eliminates the need for your _sub_array(). Look into the FileStream class to read 16 bytes at a time.
BitConverter.ToString() these 16 bytes into a StringBuilder with StringBuilder.AppendLine() (My comment), but don't do the String.Replace() until you're done reading the file.
Once you're done reading the file, you can assign the StringBuilder to your RichTextBox like so (sb is a variable name used for StringBuilder): richTextBox.Text = sb.ToString();
Hope this helps...
Related
I want to read a file into a RichTextBox without using LoadFile (I might want to display the progress). The file contains only ASCII characters.
I was thinking of reading the file in chunks.
I have done the following (which is working):
const int READ_BUFFER_SIZE = 4 * 1024;
BinaryReader reader = new BinaryReader(File.Open("file.txt", FileMode.Open));
byte[] buf = new byte[READ_BUFFER_SIZE];
do {
int ret = reader.Read(buf, 0, READ_BUFFER_SIZE);
if (ret <= 0) {
break;
}
string text = Encoding.ASCII.GetString(buf);
richTextBox.AppendText(text);
} while (true);
My concern is:
string text = Encoding.ASCII.GetString(buf);
I have seen that it is not possible to add a byte[] to a RichTextBox.
My questions are:
Will a new string object be allocated for every chunk which is read?
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?
Will a new string object be allocated for every chunk which is read?
Yes.
Isn't there a better way not to have to create a string object just for appending the text to the RichTextBox?
No, AppendText requires a string
Or, is it more efficient to read lines from the file (StreamReader.ReadLine) and just add to the RichTextBox the string returned?
No, that's considerably less efficient. You'll now create a new string object much more frequently. Which is okay from the garbage collected heap perspective, you don't create more garbage. But it is absolute murder on the RichTextBox, it constantly needs to re-allocate its own buffer. Which includes moving all the text previously read. What you have is already good, you should just use a much larger READ_BUFFER_SIZE.
Unfortunately there are conflicting goals here. You don't want to make the buffer larger than 39,999 bytes or the strings end up in the Large Object Heap and clog it up until a gen# 2 garbage collection happens. But the RTB will be much happier if you go considerably past that size, like a megabyte if the file is so large that you need a progress bar.
If you want to make it really efficient then you need to replace RichTextBox.LoadFile(). The underlying Windows message is EM_STREAMIN, it uses a callback mechanism to stream in the text. You can technically replace the callback to do what the default one does in RichTextBox, plus update a progress bar. It does permit getting rid of the strings btw. The pinvoke is pretty unfriendly, use the Reference Source for guidance.
Take the easy route first, increase the buffer size. Only consider using the pinvoke route when your code is considerably slower than using File.ReadAllText().
Try this:
richTextBox.AppendText(File.ReadAllText("file.txt"));
or
richTextBox.AppendText(File.ReadAllText("file.txt", Encoding.ASCII));
You can use a StreamReader. Then you can read eacht row of the file and display the progress while reading.
I have a very big char array that I need to convert to string in order to use Regex on it.
But it's so big that I get OutOfMemoryException when I pass that to string constructor.
I know that string is immutable and therefore it shouldn't be possible to specify its underlying character collection but I need a way to use regular expressions on that without copying the whole thing.
How do I get that array?
I get it from a file using StreamReader. I know the starting position and the length of the content to read, Read and ReadBlock methods need me to supply a char[] buffer.
So here are the things I want to know:
Is there a way to specify a string's underlaying collection? (Does it even keep its chars in an array?)
...or using Regex directly on a char array?
...or getting the part of the file directly as string?
If you have a character or pattern that you could search for that is guaranteed NOT to be in the pattern you're trying to find, you could scan the array for that character and create smaller strings to process individually. Process would be something like:
char token = '|';
int start = 0;
int length = 0;
for(int i = 0; i < charArray.Length; i++;)
{
if(charArray[i] == token)
{
string split = new string(charArray,start,length);
// check the string using the regex
// reset the length
length = 0;
}
else
{
length++;
}
}
That way you're copying smaller segments of the string that would be GCed after each attempt versus the entire string.
I would think your best bet would be to read multiple char[] chunks into individual strings that overlap with a certain dimension. This way you'd be able to perform your Regex on the individual chunks, and the overlap would provide you the ability to ensure that a "break" in the chunks doesn't break the search pattern. In a psuedo-code manner:
int chunkSize = 100000;
int overLap = 2000;
for(int i = 0; i < myCharArray.length; i += chunkSize - overlap)
{
// Grab your array chunk into a partial string
// By having your iteration slightly smaller than
// your chunk size you guarantee not to miss any
// character groupings. You just need to make sure
// your overlap is sufficient to cover the expression
string chunk = new String(myCharArray.Skip(i).Take(chunkSize).ToArray());
// run your regex
}
One rather ugly option would be to use an unmanaged RegEx library (like the POSIX regular expression library) and unsafe code. You can obtain a byte * pointer to the char array and pass it directly to the unmanaged library, then marshal the responses back.
fixed (byte * pArray = largeCharArray)
{
// call unmanaged code with pArray
}
If you are using .NET 4.0 or higher, what you should be using is a MemoryMappedFile. This class was designed exclusively so you could manipulate very large files. From the MSDN documentation:
A memory-mapped file maps the contents of a file to an application’s
logical address
space. Memory-mapped files enable programmers to work with extremely large files because
memory can be managed concurrently, and they allow complete, random access to a file
without the need for seeking. Memory-mapped files can also be shared across multiple
processes.
Once you got your memory mapped file, check out this Stack Overflow answer on how to apply RegEx to the memory mapped file.
Hope this helps!
I am using a windows mobile compact edition 6.5 phone and am writing out binary data to a file from bluetooth. These files get quite large, 16M+ and what I need to do is to once the file is written then I need to search the file for a start character and then delete everything before, thus eliminating garbage. I cannot do this inline when the data comes in due to graphing issues and speed as I get alot of data coming in and there is already too many if conditions on the incoming data. I figured it was best to post process. Anyway here is my dilemma, speed of search for the start bytes and the rewrite of the file takes sometimes 5mins or more...I basically move the file over to a temp file parse through it and rewrite a whole new file. I have to do this byte by byte.
private void closeFiles() {
try {
// Close file stream for raw data.
if (this.fsRaw != null) {
this.fsRaw.Flush();
this.fsRaw.Close();
// Move file, seek the first sync bytes,
// write to fsRaw stream with sync byte and rest of data after it
File.Move(this.s_fileNameRaw, this.s_fileNameRaw + ".old");
FileStream fsRaw_Copy = File.Open(this.s_fileNameRaw + ".old", FileMode.Open);
this.fsRaw = File.Create(this.s_fileNameRaw);
int x = 0;
bool syncFound = false;
// search for sync byte algorithm
while (x != -1) {
... logic to search for sync byte
if (x != -1 && syncFound) {
this.fsPatientRaw.WriteByte((byte)x);
}
}
this.fsRaw.Close();
fsRaw_Copy.Close();
File.Delete(this.s_fileNameRaw + ".old");
}
} catch(IOException e) {
CLogger.WriteLog(ELogLevel.ERROR,"Exception in writing: " + e.Message);
}
}
There has got to be a faster way than this!
------------Testing times using answer -------------
Initial Test my way with one byte read and and one byte write:
27 Kb/sec
using a answer below and a 32768 byte buffer:
321 Kb/sec
using a answer below and a 65536 byte buffer:
501 Kb/sec
You're doing a byte-wise copy of the entire file. That can't be efficient for a load of reasons. Search for the start offset (and end offset if you need both), then copy from one stream to another the entire contents between the two offsets (or the start offset and end of file).
EDIT
You don't have to read the entire contents to make the copy. Something like this (untested, but you get the idea) would work.
private void CopyPartial(string sourceName, byte syncByte, string destName)
{
using (var input = File.OpenRead(sourceName))
using (var reader = new BinaryReader(input))
using (var output = File.Create(destName))
{
var start = 0;
// seek to sync byte
while (reader.ReadByte() != syncByte)
{
start++;
}
var buffer = new byte[4096]; // 4k page - adjust as you see fit
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
output.Write(buffer, 0, actual);
} while (reader.PeekChar() >= 0);
}
}
EDIT 2
I actually needed something similar to this today, so I decided to write it without the PeekChar() call. Here's the kernel of what I did - feel free to integrate it with the second do...while loop above.
var buffer = new byte[1024];
var total = 0;
do
{
var actual = reader.Read(buffer, 0, buffer.Length);
writer.Write(buffer, 0, actual);
total += actual;
} while (total < reader.BaseStream.Length);
Don't discount an approach because you're afraid it will be too slow. Try it! It'll only take 5-10 minutes to give it a try and may result in a much better solution.
If the detection process for the start of the data is not too complex/slow, then avoiding writing data until you hit the start may actually make the program skip past the junk data more efficiently.
How to do this:
Use a simple bool to know whether or not you have detected the start of the data. If you are reading junk, then don't waste time writing it to the output, just scan it to detect the start of the data. Once you find the start, then stop scanning for the start and just copy the data to the output. Just copying the good data will incur no more than an if (found) check, which really won't make any noticeable difference to your performance.
You may find that in itself solves the problem. But you can optimise it if you need more performance:
What can you do to minimise the work you do to detect the start of the data? Perhaps if you are looking for a complex sequence you only need to check for one particular byte value that starts the sequence, and it's only if you find that start byte that you need to do any more complex checking. There are some very simple but efficient string searching algorithms that may help in this sort of case too. Or perhaps you can allocate a buffer (e.g. 4kB) and gradually fill it with bytes from your incoming stream. When the buffer is filled, then and only then search for the end of the "junk" in your buffer. By batching the work you can make use of memory/cache coherence to make the processing considerably more efficient than it would be if you did the same work byte by byte.
Do all the other "conditions on the incoming data" need to be continually checked? How can you minimise the amount of work you need to do but still achieve the required results? Perhaps some of the ideas above might help here too?
Do you actually need to do any processing on the data while you are skipping junk? If not, then you can break the whole thing into two phases (skip junk, copy data), and skipping the junk won't cost you anything when it actually matters.
I am trying to export SQL table data to a text file with '~' delimiter in C# code.
When data is small its fine. When it's huge, it is throwing an Out of memory exception.
My Code:
public static void DataTableToTextFile(DataTable dtToText, string filePath)
{
int i = 0;
StreamWriter sw = null;
try
{
sw = new StreamWriter(filePath, false); /*For ColumnName's */
for (i = 0; i < dtToText.Columns.Count - 1; i++)
{
sw.Write(dtToText.Columns[i].ColumnName + '~');
}
sw.Write(dtToText.Columns[i].ColumnName + '~');
sw.WriteLine(); /*For Data in the Rows*/
foreach (DataRow row in dtToText.Rows)
{
object[] array = row.ItemArray;
for (i = 0; i < array.Length - 1; i++)
{
sw.Write(array[i].ToString() + '~');
}
sw.Write(array[i].ToString() + '~');
sw.WriteLine();
}
sw.Close();
}
catch (Exception ex)
{
throw new Exception("");
}
}
Is there a better way to do this in a stored procedure or BCP command?
If there's no specific reason for using the ~ delimiter format, you might try using the DataTable WriteXml function (http://msdn.microsoft.com/en-us/library/system.data.datatable.writexml.aspx)
For example:
dtToText.WriteXml("c:\data.xml")
If you need to convert this text back to a DataTable later you can use ReadXml (http://msdn.microsoft.com/en-us/library/system.data.datatable.readxml.aspx)
If you really need to make the existing code work, I'd probably try closing and calling Dispose on the StreamWriter at a set interval, then reopen and append to the existing text.
I realize that this question is years old, but I recently experienced a similar problem. The solution: Briefly, I think that you're running into problems with the Windows Large Object Heap. A relevant link:
https://www.simple-talk.com/dotnet/.net-framework/the-dangers-of-the-large-object-heap/
To summarize the above article: When you allocate chunks of memory more than 85K long (which seems likely to happen behind the scenes in your StreamWriter object if the values in your DataTable are large enough), they go onto a separate heap, the Large Object Heap (LOH). Memory chunks in the LOH are deallocated normally when their lifetime expires, but the heap is not compacted. The net result is that a System.OutOfMemoryException is thrown, not because there isn't actually enough memory, but because there isn't enough contiguous memory in the heap at some point.
If you're using .NET framework 4.5.1 or later (which won't work on Visual Studio 2010 or before; it might work on VS2012), you can use this command:
System.Runtime.GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
This command forces LOH compaction to happen at the next garbage collection. Just put that command as the first line in your function; it will be set to CompactOnce every time this function is called, which will cause LOH compaction at some indeterminate point after the function is called.
If you don't have .NET 4.5.1, it gets uglier. The problem is that the memory allocation isn't explicit; it's happening behind the scenes in your StreamWriter, most likely. Try calling GC.Collect(), forcing garbage collection, from time to time--perhaps every 3rd time this function is called.
A warning: Lots of people will advise you that calling GC.Collect() directly is a bad idea and will slow down your application--and they're right. I just don't know a better way to handle this problem.
Edit2: I just want to make sure my question is clear: Why, on each iteration of AppendToLog(), the application uses 15mb more? (the size of the original log file)
I've got a function called AppendToLog() which receives the file path of an HTML document, does some parsing and appends it to a file. It gets called this way:
this.user_email = uemail;
string wanted_user = wemail;
string[] logPaths;
logPaths = this.getLogPaths(wanted_user);
foreach (string path in logPaths)
{
this.AppendToLog(path);
}
On every iteration, the RAM usage increases by 15mb or so. This is the function: (looks long but it's simple)
public void AppendToLog(string path)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-2");
StringBuilder fb = new StringBuilder();
FileStream sourcef;
string[] messages;
try
{
sourcef = new FileStream(path, FileMode.Open);
}
catch (IOException)
{
throw new IOException("The chat log is in use by another process."); ;
}
using (StreamReader sreader = new StreamReader(sourcef, enc))
{
string file_buffer;
while ((file_buffer = sreader.ReadLine()) != null)
{
fb.Append(file_buffer);
}
}
//Array of each line's content
messages = parseMessages(fb.ToString());
fb = null;
string destFileName = String.Format("{0}_log.txt",System.IO.Path.GetFileNameWithoutExtension(path));
FileStream destf = new FileStream(destFileName, FileMode.Append);
using (StreamWriter swriter = new StreamWriter(destf, enc))
{
foreach (string message in messages)
{
if (message != null)
{
swriter.WriteLine(message);
}
}
}
messages = null;
sourcef.Dispose();
destf.Dispose();
sourcef = null;
destf = null;
}
I've been days with this and I don't know what to do :(
Edit: This is ParseMessages, a function that uses HtmlAgilityPack to strip parts of an HTML log.
public string[] parseMessages(string what)
{
StringBuilder sb = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(what);
HtmlNodeCollection messageGroups = doc.DocumentNode.SelectNodes("//body/div[#class='mplsession']");
int messageCount = doc.DocumentNode.SelectNodes("//tbody/tr").Count;
doc = null;
string[] buffer = new string[messageCount];
int i = 0;
foreach (HtmlNode sessiongroup in messageGroups)
{
HtmlNode tablegroup = sessiongroup.SelectSingleNode("table/tbody");
string sessiontime = sessiongroup.Attributes["id"].Value;
HtmlNodeCollection messages = tablegroup.SelectNodes("tr");
if (messages != null)
{
foreach (HtmlNode htmlNode in messages)
{
sb.Append(
ParseMessageDate(
sessiontime,
htmlNode.ChildNodes[0].ChildNodes[0].InnerText
)
); //Date
sb.Append(" ");
try
{
foreach (HtmlTextNode node in htmlNode.ChildNodes[0].SelectNodes("text()"))
{
sb.Append(node.Text.Trim()); //Name
}
}
catch (NullReferenceException)
{
/*
* We ignore this exception, it just means there's extra text
* and that means that it's not a normal message
* but a system message instead
* (i.e. "John logged off")
* Therefore we add the "::" mark for future organizing
*/
sb.Append("::");
}
sb.Append(" ");
string message = htmlNode.ChildNodes[1].InnerHtml;
message = message.Replace(""", "'");
message = message.Replace(" ", " ");
message = RemoveMedia(message);
sb.Append(message); //Message
buffer[i] = sb.ToString();
sb = new StringBuilder();
i++;
}
}
}
messageGroups = null;
what = null;
return buffer;
}
As many have mentioned, this is probably just an artifact of the GC not cleaning up the memory storage as fast as you are expecting it to. This is normal for managed languages, like C#, Java, etc. You really need to find out if the memory allocated to your program is free or not if you're are interested in that usage. The questions to ask related to this are:
How long is your program running? Is it a service type program that runs continuously?
Over the span of execution does it continue to allocate memory from the OS or does it reach a steady-state? (Have you run it long enough to find out?)
Your code does not look like it will have a "memory-leak". In managed languages you really don't get memory leaks like you would in C/C++ (unless you are using unsafe or external libraries that are C/C++). What happens though is that you do need to watch out for references that stay around or are hidden (like a Collection class that has been told to remove an item but does not set the element of the internal array to null). Generally, objects with references on the stack (locals and parameters) cannot 'leak' unless you store the reference of the object(s) into an object/class variables.
Some comments on your code:
You can reduce the allocation/deallocation of memory by pre-allocating the StringBuilder to at least the proper size. Since you know you will need to hold the entire file in memory, allocate it to the file size (this will actually give you a buffer that is just a little bigger than required since you are not storing new-line character sequences but the file probably has them):
FileInfo fi = new FileInfo(path);
StringBuilder fb = new StringBuilder((int) fi.Length);
You may want to ensure the file exists before getting its length, using fi to check for that. Note that I just down-cast the length to an int without error checking as your files are less than 2GB based on your question text. If that is not the case then you should verify the length before casting it, perhaps throwing an exception if the file is too big.
I would recommend removing all the variable = null statements in your code. These are not necessary since these are stack allocated variables. As well, in this context, it will not help the GC since the method will not live for a long time. So, by having them you create additional clutter in the code and it is more difficult to understand.
In your ParseMessages method, you catch a NullReferenceException and assume that is just a non-text node. This could lead to confusing problems in the future. Since this is something you expect to normally happen as a result of something that may exist in the data you should check for the condition in the code, such as:
if (node.Text != null)
sb.Append(node.Text.Trim()); //Name
Exceptions are for exceptional/unexpected conditions in the code. Assigning significant meaning to NullReferenceException more than that there was a null reference can (likely will) hide errors in other parts of that same try block now or with future changes.
There is no memory leak. If you are using Windows Task Manager to measure the memory used by your .NET application you are not getting a clear picture of what is going on, because the GC manages memory in a complex way that Task Manager doesn't reflect.
A MS engineer wrote a great article about why .NET applications that seem to be leaking memory probably aren't, and it has links to very in depth explanations of how the GC actually works. Every .NET programmer should read them.
I would look carefully at why you need to pass a string to parseMessages, ie fb.ToString().
Your code comment says that this returns an array of each lines content. However you are actually reading all lines from the log file into fb and then converting to a string.
If you are parsing large files in parseMessages() you could do this much more efficiently by passing the StringBuilder itself or the StreamReader into parseMessages(). This would enable only loading a portion of the file into memory at any time, as opposed to using ToString() which currently forces the entire logfile into memory.
You are less likely to have a true memory leak in a .NET application thanks to garbage collection. You do not look to be using any large resources such as files, so it seems even less likely that you have an actual memory leak.
It looks like you have disposed of resources ok, however the GC is probably struggling to allocate and then deallocate the large memory chunks in time before the next iteration starts, and so you see the increasing memory usage.
While GC.Collect() may allow you to force memory deallocation, I would strongly advise looking into the suggestions above before resorting to trying to manually manage memory via GC.
[Update] Seeing your parseMessages() and the use of HtmlAgilityPack (a very useful library, by the way) it looks likely there are some large and possibly numerous allocations of memory being performed for every logile.
HtmlAgility allocates memory for various nodes internally, when combined with your buffer array and the allocations in the main function I'm even more confident that the GC is being put under a lot of pressure to keep up.
To stop guessing and get some real metrics, I would run ProcessExplorer and add the columns to show the GC Gen 0,1,2 collections columns. Then run your application and observe the number of collections. If you're seeing large numbers in these columns then the GC is struggling and you should redesign to use less memory allocations.
Alternatively, the free CLR Profiler 2.0 from Microsoft provides nice visual representation of .NET memory allocations within your application.
One thing you may want to try, is temporarily forcing a GC.Collect after each run. The GC is very intelligent, and will not reclaim memory until is feels the expense of a collection is worth the value of any recovered memory.
Edit: I just wanted to add that its important to understand that calling GC.Collect manually is a bad practice (for any normal use case. Abnormal == perhaps a load function for a game or somesuch). You should let the garbage collector decide whats best, as it will generally have more information than avaliable to you about system resources and the like on which to base its collection behaviour.
The try-catch block could use a finally (cleanup). If you look at what the using statement does, it is equivalent to try catch finally. Yes, running GC is a good idea also. Without compiling this code and giving it a try it is hard to say for sure ...
Also, dispose this guy properly using a using:
FileStream destf = new FileStream(destFileName, FileMode.Append);
Look up Effective C# 2nd edition
I would manually clear the array of message and the stringbuilder before the setting them to null.
edit
looking at what the process seem to do I got a suggestion, if it's not too late instead of parsing an html file.
create a dataset schemas and use that to write and read an xml log file and use a xsl file to convert it into an html file.
I don't see any obvious memory leaks; my first guess would be that it's something in the library.
A good tool to figure this kind of thing out is the .NET Memory Profiler, by SciTech. They have a free two-week trial.
Short of that, you could try commenting out some of the library functions, and see if the problem goes away if you just read the files and do nothing with the data.
Also, where are you looking for memory use stats? Keep in mind that the stats reported by Task Manager aren't always very useful or reflective of actual memory use.
HtmlDocument class (as far as I can determin) has a serious memory leak when used from managed code. I reccomend using the XMLDOM parser instead (though this does require well formed documents, but thats another +).