How to handle large amounts of data in C#?

How to handle large amounts of data in C#? - c#

I have data stored in several seperate text files that I parse and analyze afterwards.
The size of the data processed differs a lot. It ranges from a few hundred megabytes (or less) to 10+ gigabytes.
I started out with storing the parsed data in a List<DataItem> because I wanted to perform a BinarySearch() during the analysis. However, the program throws an OutOfMemory-Exception if too much data is parsed. The exact amount the parser can handle depends on the fragmentation of the memory. Sometimes it's just 1.5 gb of the files and some other time it's 3 gb.
Currently I'm using a List<List<DataItem>> with a limited number of entries because I thought it would change anything for the better. There weren't any significant improvements though.
Another way I tried was serializing the parser data and than deserializing it if needed. The result of that approach was even worse. The whole process took much longer.
I looked into memory mapped files but I don't really know if they could help me because I never used them before. Would they?
So how can I quickly access the data from all the files without the danger of throwing an OutOfMemoryException and find DataItems depending on their attributes?
EDIT: The parser roughly works like this:
void Parse() {
LoadFile();
for (int currentLine = 1; currentLine < MAX_NUMBER_OF_LINES; ++currentLine) {
string line = GetLineOfFile(currentLine);
string[] tokens = SplitLineIntoTokens(line);
DataItem data = PutTokensIntoDataItem(tokens);
try {
List<DataItem>.Add(data);
} catch (OutOfMemoryException ex) {}
}
}
void LoadFile(){
DirectoryInfo di = new DirectroyInfo(Path);
FileInfo[] fileList = di.GetFiles();
foreach(FileInfo fi in fileList)
{
//...
StreamReader file = new SreamReader(fi.FullName);
//...
while(!file.EndOfStram)
strHelp = file.ReadLine();
//...
}
}

There is no right answer for this I believe. The implementation depends on many factors that only you can rate pros and cons on.
If your primary purpose is to parse large files and large number of them, keeping these in memory irrespective of how much RAM is available should be a secondary option, for various reasons for e.g. like persistance at times when an unhandled exception occured.
Although when profiling under initial conditions you may be encouraged and inclined to load them to memory retain for manipulation and search, this will soon change as the number of files increase and in no time your application supporters will start ditching this.
I would do the below
Read and store each file content to a document database like Raven DB for e.g.
Perform parse routine on these documents and store the relevant relations in an rdbms db if that is the requirement
Search at will, fulltext or otherwise, on either the document db (raw) or relational (your parse output)
By doing this, you are taking advantage of research done by the creators of these systems in managing the memory efficiently with focus on performance
I realise that this may not be the answer for you, but for someone who may think this is better and suits perhaps yes.

If the code in your question is representative of the actual code, it looks like you're reading all of the data from all of the files into memory, and then parsing. That is, you have:
Parse()
LoadFile();
for each line
....
And your LoadFile loads all of the files into memory. Or so it seems. That's very wasteful because you maintain a list of all the un-parsed lines in addition to the objects created when you parse.
You could instead load only one line at a time, parse it, and then discard the unparsed line. For example:
void Parse()
{
foreach (var line in GetFileLines())
{
}
}
IEnumerable<string> GetFileLines()
{
foreach (var fileName in Directory.EnumerateFiles(Path))
{
foreach (var line in File.ReadLines(fileName)
{
yield return line;
}
}
}
That limits the amount of memory you use to hold the file names and, more importantly, the amount of memory occupied by un-parsed lines.
Also, if you have an upper limit to the number of lines that will be in the final data, you can pre-allocate your list so that adding to it doesn't cause a re-allocation. So if you know that your file will contain no more than 100 million lines, you can write:
void Parse()
{
var dataItems = new List<DataItem>(100000000);
foreach (var line in GetFileLines())
{
data = tokenize_and_build(line);
dataItems.Add(data);
}
}
This reduces fragmentation and out of memory errors because the list is pre-allocated to hold the maximum number of lines you expect. If the pre-allocation works, then you know you have enough memory to hold references to the data items you're constructing.
If you still run out of memory, then you'll have to look at the structure of your data items. Perhaps you're storing too much information in them, or there are ways to reduce the amount of memory used to store those items. But you'll need to give us more information about your data structure if you need help reducing its footprint.

You can use:
Data Parallelism (Task Parallel Library)
Write a Simple Parallel.ForEach
I think it will make it will reduce memory exception and make files handling faster.

Related

What's the fastest way to copy files responsive to multiple search terms?

Presently working on an application that allows the user to input a list of names/search terms and an folder path. The application then searches for each phrase and copies any responsive documents to an output path. Important to note that the use is often with directories containing from 100GB up to a few TB and can sometimes be required to run thousands of search terms.
Initially I simply used the System.IO.GetFiles() function for this, but I've found I have better results creating a data table of all documents in the input path and running my searches over that data table (see below).
//Constructing a data table of all files in the input path
foreach (var file in fileArray)
{
System.Data.DataRow row = searchTable.NewRow();
row[1] = file;
row[0] = System.IO.Path.GetFileName(file);
searchTable.Rows.Add(row);
}
//For each line inputted by the user, search the data table to find any responsive file names
foreach (var line in searchArray)
{
for (int i = 0; i < searchTable.Rows.Count; i++)
{
if (searchTable.Rows[i][0].ToString().Contains(line))
{
string file = searchTable.Rows[i][1].ToString();
string output = SwiftBank.CalculateOutputFilePath(outputPath,inputPath,file);
System.IO.File.Copy(file, output);
}
}
}
I've found that while this works, it isn't optimised and functions very slowly for large data sets. Obviously doing a lot of repeat work, searching the data table in full every search term. Wondering if someone on here might have a better idea?

In my experience, doing a handful of contains queries for a few thousands fairly short strings should take less than a second. If you are searching inside much larger data sets, like searching thru 100Gb of content, you should look at some more advanced library, like lucene.
There are a few things I would suggest changing
Use a regular list instead of a datatable. Something like List<(string filePath, string fileName)> would be much simpler, and contain the same information
Perform all checks for a specific file at once, i.e. reorder your loops the the file loop is the outer one. This should help cache usage a little bit.
However, the vast majority of the time will likely be spent on copying files. This is many orders of magnitude slower than doing some simple searching in a few kilobytes of memory. You might gain a little bit by doing more than one copy in parallel, since SSDs may be able to improve throughput at higher loads, but that is likely only true if the files are small. You might consider alternatives to the copying, like adding shortcuts, instead.

Fastest way to locate a line which contains a specific word in a large text file

I'm trying to locate a line which contains a specific text inside a large text file (18 MB), currently I'm using StreamReader to open the file and read it line by line checking if it contains the search string
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
But unfortunately, because the file I'm using has more than 1 million records, this method is slow. What is the quickest way to achieve this?

In general, disk IO of this nature is just going to be slow. There is likely little you can do to improve over your current version in terms of performance, at least not without dramatically changing the format in which you store your data, or your hardware.
However, you could shorten the code and simplify it in terms of maintenance and readability:
var lines = File.ReadLines(filename).Where(l => l.Contains("search string"));
foreach(var line in lines)
{
// Do something here with line
}
Reading the entire file into memory causes the application to hang and is very slow, do you think there are any other alternative
If the main goal here is to prevent application hangs, you can do this in the background instead of in a UI thread. If you make your method async, this can become:
while ((line = await reader.ReadLineAsync()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
This will likely make the total operation take longer, but not block your UI thread while the file access is occurring.

Get a hard drive with a faster read speed (moving to a solid state drive if you aren't already would likely help a lot).
Store the data across several files each on different physical drives. Search through those drives in parallel.
Use a RAID0 hard drive configuration. (This is sort of a special case of the previous approach.)
Create an index of the lines in the file that you can use to search for specific words. (Creating the index will be a lot more expensive than a single search, and will require a lot of disk space, but it will allow subsequent searches at much faster speeds.)

Binary Search - How to load +5M records from a file into Range<int>[] array?

This question is a follow up to my previous question regarding binary search (Fast, in-memory range lookup against +5M record table).
I have sequential text file, with over 5M records/lines, in the format below. I need to load it into Range<int>[] array. How would one do that in a timely fashion?
File format:
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
start int64,end int64,result int
...

I'm going to assume you have a good disk. Scan through the file once and count the number of entries. If you can guarantee your file has no blank lines, then you can just count the number of newlines in it -- don't actually parse each line.
Now you can allocate your array once with exactly that many entries. This avoids excessive re-allocations of the array:
var numEntries = File.ReadLines(filepath).Count();
var result = new Range<int>[numEntries];
Now read the file again and create your range objects with code something like:
var i = 0;
foreach (var line in File.ReadLines(filepath))
{
var parts = line.Split(',');
result[i++] = new Range<int>(long.Parse(parts[0]), long.Parse(parts[1]), int.Parse(parts[2]);
}
return result;
Sprinkle in some error handling as desired. This code is easy to understand. Try it out in your target environment. If it is too slow, then you can start optimizing it. I wouldn't optimize prematurely though because that will lead to much more complex code that might not be needed.

This is a typical (?) producer-consumer problem which can be solved using multiple threads. In your case the producer is reading data from disk and the consumer is parsing the lines and populating the array. I can see two different cases:
Producer is (much) faster than the consumer: in this case you should try using more consumer threads;
Consumer is (much) faster than the producer: you can't do very much to speed up things other than affecting your hardware configuration such as buying a faster hard disk or using a RAID 0. In this case I wouldn't even use a multithreading solution because it's not worth the added complexity.
This question might help you implementing that in C#.

Removing redundant data from large file

I have a log file which has single strings on each line. I am trying to remove duplicate data from the file and save the file out as a new file. I had first thought of reading data into a HashSet and then saving the contents of the hashset out, however I get an "OutOfMemory" exception when attempting to do this (on the line that adds the string to the hashset).
There are around 32,000,000 lines in the files. It's not practical to re-read the entire file for each comparison.
Any ideas? My other thought was to output the entire contents into a SQLite database and selecting DISTINCT values, but I'm not sure that'd work either with that many values.
Thanks for any input!

First thing you need to think about - is high memory consumption is a problem?
If your application will always run on server with a lot of RAM available, or in any other case you know you'll have enough memory, you can do a lot of things you can't do if your application will run in a low-memory environment, or in an unknown environment. If memory isn't the problem, then make sure your application is running as a 64-bit application (of course, on 64-bit OS), otherwise you'll be limited to 2GB memory (4GB, if you'll use LARGEADDRESSAWARE flag). I guess then in this case this is your problem, and all you've got to do is change it - and it'll work great (assuming you have enough memory).
If memory is a problem, and you need not to use too much memory, you can as you suggested add all the data to database (i'm more familiar with databases like SQL Server, but i guess SQLite will do), make sure you have the right index on the column, and then select distinct value.
Another option, is to read the file as a stream, line by line, for each line calculate hash, and save the line into other file, and keep the hash in the memory. if the hash already exists, then moving to the next line (and, if you wish, adding to a counter of number of lines removed). in that case, you'll save less data in the memory (only hash for not duplicated items).
Best of luck.

Have you tried to use an array to intialize the HashSet. I assume that the doubling algorithm of HashSet is the reason for the OutOfMemoryException.
var uniqueLines = new HashSet<string>(File.ReadAllLines(#"C:\Temp\BigFile.log"));
Edit:
I am testing the result of the .Add() method to see if it
returns false to count the number of items that are redundant. I'd
like to keep this feature if possible.
Then you should try to initilize the HashSet with the correct(maximum) size of the file's lines:
int lineCount = File.ReadLines(path).Count();
List<string> fooList = new List<String>(lineCount);
var uniqueLines = new HashSet<string>(fooList);
fooList.Clear();
foreach (var line in File.ReadLines(path))
uniqueLines.Add(line);

I took a similar approach to Tim using HashSet. I did add manual line counting and comparison.
I read the setup log from my windows 8 install which was 58MB in size at 312248 lines and ran it in LinqPad in .993 seconds.
var temp=new List<string>(10000);
var uniqueHash=new HashSet<int>();
int lineCount=0;
int uniqueLineCount=0;
using(var fs=new FileStream(#"C:\windows\panther\setupact.log",FileMode.Open,FileAccess.Read))
using(var sr=new StreamReader(fs,true)){
while(!sr.EndOfStream){
lineCount++;
var line=sr.ReadLine();
var key=line.GetHashCode();
if(!uniqueHash.Contains(key) ){
uniqueHash.Add(key);
temp.Add(line);
uniqueLineCount++;
if(temp.Count()>10000){
File.AppendAllLines(#"c:\temp\output.txt",temp);
temp.Clear();
}
}
}
}
Console.WriteLine("Total Lines:"+lineCount.ToString());
Console.WriteLine("Lines Removed:"+ (lineCount-uniqueLineCount).ToString());

Where's the memory leak in this function?

Edit2: I just want to make sure my question is clear: Why, on each iteration of AppendToLog(), the application uses 15mb more? (the size of the original log file)
I've got a function called AppendToLog() which receives the file path of an HTML document, does some parsing and appends it to a file. It gets called this way:
this.user_email = uemail;
string wanted_user = wemail;
string[] logPaths;
logPaths = this.getLogPaths(wanted_user);
foreach (string path in logPaths)
{
this.AppendToLog(path);
}
On every iteration, the RAM usage increases by 15mb or so. This is the function: (looks long but it's simple)
public void AppendToLog(string path)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-2");
StringBuilder fb = new StringBuilder();
FileStream sourcef;
string[] messages;
try
{
sourcef = new FileStream(path, FileMode.Open);
}
catch (IOException)
{
throw new IOException("The chat log is in use by another process."); ;
}
using (StreamReader sreader = new StreamReader(sourcef, enc))
{
string file_buffer;
while ((file_buffer = sreader.ReadLine()) != null)
{
fb.Append(file_buffer);
}
}
//Array of each line's content
messages = parseMessages(fb.ToString());
fb = null;
string destFileName = String.Format("{0}_log.txt",System.IO.Path.GetFileNameWithoutExtension(path));
FileStream destf = new FileStream(destFileName, FileMode.Append);
using (StreamWriter swriter = new StreamWriter(destf, enc))
{
foreach (string message in messages)
{
if (message != null)
{
swriter.WriteLine(message);
}
}
}
messages = null;
sourcef.Dispose();
destf.Dispose();
sourcef = null;
destf = null;
}
I've been days with this and I don't know what to do :(
Edit: This is ParseMessages, a function that uses HtmlAgilityPack to strip parts of an HTML log.
public string[] parseMessages(string what)
{
StringBuilder sb = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(what);
HtmlNodeCollection messageGroups = doc.DocumentNode.SelectNodes("//body/div[#class='mplsession']");
int messageCount = doc.DocumentNode.SelectNodes("//tbody/tr").Count;
doc = null;
string[] buffer = new string[messageCount];
int i = 0;
foreach (HtmlNode sessiongroup in messageGroups)
{
HtmlNode tablegroup = sessiongroup.SelectSingleNode("table/tbody");
string sessiontime = sessiongroup.Attributes["id"].Value;
HtmlNodeCollection messages = tablegroup.SelectNodes("tr");
if (messages != null)
{
foreach (HtmlNode htmlNode in messages)
{
sb.Append(
ParseMessageDate(
sessiontime,
htmlNode.ChildNodes[0].ChildNodes[0].InnerText
)
); //Date
sb.Append(" ");
try
{
foreach (HtmlTextNode node in htmlNode.ChildNodes[0].SelectNodes("text()"))
{
sb.Append(node.Text.Trim()); //Name
}
}
catch (NullReferenceException)
{
/*
* We ignore this exception, it just means there's extra text
* and that means that it's not a normal message
* but a system message instead
* (i.e. "John logged off")
* Therefore we add the "::" mark for future organizing
*/
sb.Append("::");
}
sb.Append(" ");
string message = htmlNode.ChildNodes[1].InnerHtml;
message = message.Replace(""", "'");
message = message.Replace(" ", " ");
message = RemoveMedia(message);
sb.Append(message); //Message
buffer[i] = sb.ToString();
sb = new StringBuilder();
i++;
}
}
}
messageGroups = null;
what = null;
return buffer;
}

As many have mentioned, this is probably just an artifact of the GC not cleaning up the memory storage as fast as you are expecting it to. This is normal for managed languages, like C#, Java, etc. You really need to find out if the memory allocated to your program is free or not if you're are interested in that usage. The questions to ask related to this are:
How long is your program running? Is it a service type program that runs continuously?
Over the span of execution does it continue to allocate memory from the OS or does it reach a steady-state? (Have you run it long enough to find out?)
Your code does not look like it will have a "memory-leak". In managed languages you really don't get memory leaks like you would in C/C++ (unless you are using unsafe or external libraries that are C/C++). What happens though is that you do need to watch out for references that stay around or are hidden (like a Collection class that has been told to remove an item but does not set the element of the internal array to null). Generally, objects with references on the stack (locals and parameters) cannot 'leak' unless you store the reference of the object(s) into an object/class variables.
Some comments on your code:
You can reduce the allocation/deallocation of memory by pre-allocating the StringBuilder to at least the proper size. Since you know you will need to hold the entire file in memory, allocate it to the file size (this will actually give you a buffer that is just a little bigger than required since you are not storing new-line character sequences but the file probably has them):
FileInfo fi = new FileInfo(path);
StringBuilder fb = new StringBuilder((int) fi.Length);
You may want to ensure the file exists before getting its length, using fi to check for that. Note that I just down-cast the length to an int without error checking as your files are less than 2GB based on your question text. If that is not the case then you should verify the length before casting it, perhaps throwing an exception if the file is too big.
I would recommend removing all the variable = null statements in your code. These are not necessary since these are stack allocated variables. As well, in this context, it will not help the GC since the method will not live for a long time. So, by having them you create additional clutter in the code and it is more difficult to understand.
In your ParseMessages method, you catch a NullReferenceException and assume that is just a non-text node. This could lead to confusing problems in the future. Since this is something you expect to normally happen as a result of something that may exist in the data you should check for the condition in the code, such as:
if (node.Text != null)
sb.Append(node.Text.Trim()); //Name
Exceptions are for exceptional/unexpected conditions in the code. Assigning significant meaning to NullReferenceException more than that there was a null reference can (likely will) hide errors in other parts of that same try block now or with future changes.

There is no memory leak. If you are using Windows Task Manager to measure the memory used by your .NET application you are not getting a clear picture of what is going on, because the GC manages memory in a complex way that Task Manager doesn't reflect.
A MS engineer wrote a great article about why .NET applications that seem to be leaking memory probably aren't, and it has links to very in depth explanations of how the GC actually works. Every .NET programmer should read them.

I would look carefully at why you need to pass a string to parseMessages, ie fb.ToString().
Your code comment says that this returns an array of each lines content. However you are actually reading all lines from the log file into fb and then converting to a string.
If you are parsing large files in parseMessages() you could do this much more efficiently by passing the StringBuilder itself or the StreamReader into parseMessages(). This would enable only loading a portion of the file into memory at any time, as opposed to using ToString() which currently forces the entire logfile into memory.
You are less likely to have a true memory leak in a .NET application thanks to garbage collection. You do not look to be using any large resources such as files, so it seems even less likely that you have an actual memory leak.
It looks like you have disposed of resources ok, however the GC is probably struggling to allocate and then deallocate the large memory chunks in time before the next iteration starts, and so you see the increasing memory usage.
While GC.Collect() may allow you to force memory deallocation, I would strongly advise looking into the suggestions above before resorting to trying to manually manage memory via GC.
[Update] Seeing your parseMessages() and the use of HtmlAgilityPack (a very useful library, by the way) it looks likely there are some large and possibly numerous allocations of memory being performed for every logile.
HtmlAgility allocates memory for various nodes internally, when combined with your buffer array and the allocations in the main function I'm even more confident that the GC is being put under a lot of pressure to keep up.
To stop guessing and get some real metrics, I would run ProcessExplorer and add the columns to show the GC Gen 0,1,2 collections columns. Then run your application and observe the number of collections. If you're seeing large numbers in these columns then the GC is struggling and you should redesign to use less memory allocations.
Alternatively, the free CLR Profiler 2.0 from Microsoft provides nice visual representation of .NET memory allocations within your application.

One thing you may want to try, is temporarily forcing a GC.Collect after each run. The GC is very intelligent, and will not reclaim memory until is feels the expense of a collection is worth the value of any recovered memory.
Edit: I just wanted to add that its important to understand that calling GC.Collect manually is a bad practice (for any normal use case. Abnormal == perhaps a load function for a game or somesuch). You should let the garbage collector decide whats best, as it will generally have more information than avaliable to you about system resources and the like on which to base its collection behaviour.

The try-catch block could use a finally (cleanup). If you look at what the using statement does, it is equivalent to try catch finally. Yes, running GC is a good idea also. Without compiling this code and giving it a try it is hard to say for sure ...
Also, dispose this guy properly using a using:
FileStream destf = new FileStream(destFileName, FileMode.Append);
Look up Effective C# 2nd edition

I would manually clear the array of message and the stringbuilder before the setting them to null.
edit
looking at what the process seem to do I got a suggestion, if it's not too late instead of parsing an html file.
create a dataset schemas and use that to write and read an xml log file and use a xsl file to convert it into an html file.

I don't see any obvious memory leaks; my first guess would be that it's something in the library.
A good tool to figure this kind of thing out is the .NET Memory Profiler, by SciTech. They have a free two-week trial.
Short of that, you could try commenting out some of the library functions, and see if the problem goes away if you just read the files and do nothing with the data.
Also, where are you looking for memory use stats? Keep in mind that the stats reported by Task Manager aren't always very useful or reflective of actual memory use.

HtmlDocument class (as far as I can determin) has a serious memory leak when used from managed code. I reccomend using the XMLDOM parser instead (though this does require well formed documents, but thats another +).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.