providing random read access to a very large (50GB+) ASCII file

providing random read access to a very large (50GB+) ASCII file - c#

My task is to provide random read access to a very large (50GB+) ASCII text file (processing requests for nth line/nth word in nth line) in a form of C# console app.
After googling and reading for a few days, I've come to such vision of implementation:
Since StreamReader is good at sequential access, use it to build an index of lines/words in file (List<List<long>> map, where map[i][j] is position where jth word of ith line starts). And then use the index to access file through MemoryMappedFile, since it good at providing random access.
Are there some obvious flaws in the solution? Would it be optimal for a given task?
UPD: It will be executed at 64bit system.

It seems fine, but if you're using MemoryMapping then your program will only work on a 64-bit system because you're excessing the effective 2GB address space available.
You'll be fine with just using a FileStream and calling .Seek() to jump to the selected offset as appropriate, so I don't see a need for using MemoryMapped files.

I believe your solution is a good start - even thou List container is not the best Map container - Lists are very slow to read arbitrary elements.
I would test whether doing List<List<long>> map is the best in terms of memory/speed tradeoff - since OS caches memory maps at page boundaries (4096 bytes on x86/x64), it might be actually faster to only look up the address of the start of each line, and then scan the line looking for words.
Obviously, this approach would only work on 64bit OS, but the performance benefit of an MMap is significant - this is one of the few places where going 64bit matters a lot - database applications :)

Related

Lucene.net runs out of memory for large index

I have a Lucene.net index with 10 fields, some stored and some indexed, with 460 million documents. The index is about 250GB. I'm using Lucene.net 3.0.3 and every time I do a search I easily eat up 2GB+ in RAM, which causes my 32 bit application to get out of memory exceptions. I unfortunately cannot run the app as a 64 bit process due to other 32 bit dependencies.
As far as I know I'm following Lucene best practices:
One open index writer that writes documents in batches
A shared reader that doesn't close and reopen itself across searches
The index searcher has a termInfosIndexDivisor set to 4, which didn't seem to make a difference. I even tried setting it to something huge like 1000 but didn't notice any memory changes.
Fields that do not need to be subsearched aren't analyzed (i.e. full string searching only) and fields that don't need to be retrieved back from the search aren't stored.
I'm using the default StandardAnalyzer for both indexing and searching.
If I prune the data and make a smaller index, then things do work. When I have an index that is around 50GB in size I can search it with only about 600MB of RAM
However, I do have a sort applied on one of the fields, but even without the sort the memory usage is huge for any search. I don't particularly care about document score, more that the document exists in my index, but I'm not sure if somehow ignoring the score calculation will help with the memory usage.
I recently upgraded from Lucene.net 2.9.4 to Lucene.net 3.0.3 thinking that that might help, but the memory usage looks about the same between the two versions.
Frankly I'm not sure if this index is just too large for a single machine to feasbily search or not. Most examples I find talk about indexes 20-30GB in size or less so maybe this isn't possible, but I wanted to at least ask.
If anyone has any suggestions on what I can do to make this useable that would be great. I am willing to sacrifice search speed for memory usage if possible.

You CAN run the app in 64 bit - make a separate process for the lucene part, use remoting to communicate with it (or WCF). Finished. Standard Approach.
You think about Splitting it already, so heck, Isolate it and put it on 64 bit.

Best strategy to implement reader for large text files

We have an application which logs its processing steps into text files. These files are used during implementation and testing to analyse problems. Each file is up to 10MB in size and contains up to 100,000 text lines.
Currently the analysis of these logs is done by opening a text viewer (Notepad++ etc) and looking for specific strings and data depending on the problem.
I am building an application which will help the analysis. It will enable a user to read files, search, highlight specific strings and other specific operations related to isolating relevant text.
The files will not be edited!
While playing a little with some concepts, I found out immediately that TextBox (or RichTextBox) don't handle display of large text very well. I managed to to implement a viewer using DataGridView with acceptable performance, but that control does not support color highlighting of specific strings.
I am now thinking of holding the entire text file in memory as a string, and only displaying a very limited number of records in the RichTextBox. For scrolling and navigating I thought of adding an independent scrollbar.
One problem I have with this approach is how to get specific lines from the stored string.
If anyone has any ideas, can highlight problems with my approach then thank you.

I would suggest loading the whole thing into memory, but as a collection of strings rather than a single string. It's very easy to do that:
string[] lines = File.ReadAllLines("file.txt");
Then you can search for matching lines with LINQ, display them easily etc.

Here is an approach that scales well on modern CPU's with multiple cores.
You create an iterator block that yields the lines from the text file (or multiple text files if required):
IEnumerable<String> GetLines(String fileName) {
using (var streamReader = File.OpenText(fileName))
while (!streamReader.EndOfStream)
yield return streamReader.ReadLine();
}
You then use PLINQ to search the lines in parallel. Doing that can speed up the search considerably if you have a modern CPU.
GetLines(fileName)
.AsParallel()
.AsOrdered()
.Where(line => ...)
.ForAll(line => ...);
You supply a predicate in Where that matches the lines you need to extract. You then supply an action to ForAll that will send the lines to their final destination.
This is a simplified version of what you need to do. Your application is a GUI application and you cannot perform the search on the main thread. You will have to start a background task for this. If you want this task to be cancellable you need to check a cancellation token in the while loop in the GetLines method.
ForAll will call the action on threads from the thread pool. If you want to add the matching lines to a user interface control you need to make sure that this control is updated on the user interface thread. Depending on the UI framework you use there are different ways to do that.
This solution assumes that you can extract the lines you need by doing a single forward pass of the file. If you need to do multiple passes perhaps based on user input you may need to cache all lines from the file in memory instead. Caching 10 MB is not much but lets say you decide to search multiple files. Caching 1 GB can strain even a powerful computer but using less memory and more CPU as I suggest will allow you to search very big files within a reasonable time on a modern desktop PC.

I suppose that when one has multiple gigabytes of RAM available, one naturally gravitates towards the "load the whole file into memory" path, but is anyone here really satisfied with such a shallow understanding of the problem? What happens when this guy wants to load a 4 gigabyte file? (Yeah, probably not likely, but programming is often about abstractions that scale and the quick fix of loading the whole thing into memory just isn't scalable.)
There are, of course, competing pressures: do you need a solution yesterday or do you have the luxury of time to dig into the problem and learning something new? The framework also influences your thinking by presenting block-mode files as streams... you have to check the stream's BaseStream.CanSeek value and, if that is true, access the BaseStream.Seek() method to get random access. Don't get me wrong, I absolutely love the .NET framework, but I see a construction site where a bunch of "carpenters" can't put up the frame for a house because the air-compressor is broken and they don't know how to use a hammer. Wax-on, wax-off, teach a man to fish, etc.
So if you have time, look into a sliding window. You can probably do this the easy way by using a memory-mapped file (let the framework/OS manage the sliding window), but the fun solution is to write it yourself. The basic idea is that you only have a small chunk of the file loaded into memory at any one time (the part of the file that is visible in your interface with maybe a small buffer on either side). As you move forward through the file, you can save the offsets of the beginning of each line so that you can easily seek to any earlier section of the file.
Yes, there are performance implications... welcome to the real world where one is faced with various requirements and constraints and must find the acceptable balance between time and memory utilization. This is the fun of programming... figuring out the various ways that a goal can be reached and learning what the tradeoffs are between the various paths. This is how you grow beyond the skill levels of that guy in the office who sees every problem as a nail because he only knows how to use a hammer.
[/rant]

I would suggest to use MemoryMappedFile in .NET 4 (or via DllImport in previous versions) to handle just small portion of file that visible on screen instead of wasting memory and time with loading of entire file.

Determine memory alignment of a process with C#

I'm writing a c# application that takes a string to any friendly process name(say 'notepad') and reads the process memory. It is fine for reading bytes but I have no idea if those are int32s, chars, bools or other types of data. One of the first steps to solving that is knowing how the data is padded. how can I determine the data alignment of the memory?
I've learned it isn't as simple as knowing the OS or processor. Different packings are supposedly possible even then: http://www.developerfusion.com/article/84519/mastering-structs-in-c/
So, is there some pinvoke I could use on the process handle to read some value or maybe an algorithm that reads some bytes and tests what it finds?
Motivation(in case someone has a better solution for my end goal): I don't want to look for potential int32 values(or any other type) by looking at relative address 0,1,2,3 and then looking at 1,2,3,4 and so on if I can help it. If memory is say 4-byte aligned, I'd be wasting a lot of effort for nothing when I could just check 0,1,2,3 and skip to 4,5,6,7.

I'm not quite sure what you're trying to do - but my best bet is that you're hoping to dig around the process to find a bug or get an idea what they're up to?
the best way to figure out the memory layout will be from the symbols (.pdb). Is this an app that you've written?
Assuming not, you might consider injecting a thread and then calling MiniDumpWriteDump(). This API can dump the memory to disk where you can browse it with windbg.
The idea here will to use the Microsoft public symbols (!symfix) and then to go routing around the memory looking for whaterver you're needing. having the symbols for the Microsoft bits will help you - with those you'll be able to figure out where threads/heaps/handles/etc are located

Memory Efficient Recursion

I have written an application in C# that generates all the words that can be existed in the combination of alphabets, numbers and few special characters.
The problem is that it isn't memory efficient as it is adapting Recursion and also some collection like List.
Is there any way I can make it to run in limited memory environment?
Umair

Convert it to an iterative function.

Unfortunately C# compiler does not perform tail call optimization, which is something that you want to happen in this case. CLR supports it, kinda, but you shouldn't rely on it.
Perhaps left of field, but maybe you can write the recursive part of your program in F#? This way you can leverage guaranteed tail call optimization and reuse bits of your C# code. Whilst a steep learning curve, F# is a more suitable language for these combinatorial tasks.

Well...I am not sure whom with I go amongst you but I got the solution. I am using more than one process one that is interacting with user and other for finding the words combination. The other process finds 5000 words, save them and quit. Communication is being achieved through WCF. This looks pretty fine as when process quits = frees memory.

Well, you obviously cannot store the intermediate results in memory (unless you've got some sort of absurd computer at your disposal); you will have to write the results to disk.
The recursion depth isn't a result of the number of considered characters - its determined by what the maximum string length you're willing to consider.
For instance, my install of python 2.6.2 has it's default recursion limit set to 1000. Arguable, I should be able to generate all possible 1-1000 length strings given a character set within this limitation (now, I think the recursion limit applies to total stack depth, so the actual limit may be less than 1000).
Edit (added python sample):
The following python snippet will produce what you're asking for (limiting itself to the given runtime stack limits):
from string import ascii_lowercase
def generate(base="", charset=ascii_lowercase):
for c in charset:
next = base + c
yield next
try:
for s in generate(next, charset):
yield s
except:
continue
for s in generate():
print s
One could produce essentially the same in C# by try/catching on StackOverflowException. As I'm typing this update, the script is running, chewing up one of my cores. However, memory usage is constant at less than 7MB. Now, I'm only print to stdout since I'm not interested in capturing the result, but I think it proves the point above. ;)
Addendum to the example:
Interesting note: Looking closer at running processes, python is actually I/O bound with the above example. It's only using 7% of my CPU, while the rest of the core is bound rending the results in my command window. Minimizing the window allows python to climb to 40% of total CPU usage, this is on a 2 core machine.

One more consideration: When you concatenate or use some other method to generate a string in C#, it occupies its own memory and may stick around for a while. If you are generating millions of strings, you are likely to notice some performance drag.
If you don't need to keep your many strings around, I would see if there's away to avoid generating the strings. For example, maybe you have a character array that you keep updating as you move through the character combinations, and if you're outputting them to a file, you would output them one character at a time so you don't have to build the string.

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.
The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
I'd like to keep the number of files at a manageable level, let's say o(10).
For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:
Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
while (bytesRemaining > 0)
{
int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
if (!zeroes) _rnd.NextBytes(buffer);
fileStream.Write(buffer, 0, sizeOfChunkToWrite);
bytesRemaining -= sizeOfChunkToWrite;
}
fileStream.Close();
}
With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.
For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.
Neither of these is quite satisfactory for me.
I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.
The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.
What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.
Currently I have an approach that sort of works but it takes too long to run.
Has anyone else solved this?
Is there a much faster way to write a text file than via StreamWriter?
Suggestions?
EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).
You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.
If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.
Edit
Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

You could always code yourself a little web crawler...
UPDATE
Calm down guys, this would be a good answer, if he hadn't said that he already had a solution that "takes too long".
A quick check here would appear to indicate that downloading 8GB of anything would take a relatively long time.

I think you might be looking for something like a Markov chain process to generate this data. It's both stochastic (randomised), but also structured, in that it operates based on a finite state machine.
Indeed, Markov chains have been used for generating semi-realistic looking text in human languages. In general, they are not trivial things to analyse properly, but the fact that they exhibit certain properties should be good enough for you. (Again, see Properties of Markov chains section of the page.) Hopefully you should see how to design one, however - to implement, it is actually quite a simple concept. Your best bet will probably be to create a framework for a generic Markov process and then analyse either natural language or source code (whichever you want your random data to emulate) in order to "train" your Markov process. In the end, this should give you very high quality data in terms of your requirements. Well worth going to the effort, if you need these enormous lengths of test data.

I think the Windows directory will probably be a good enough source for your needs. If you're after text, I would recurse through each of the directories looking for .txt files and loop through them copying them to your output file as many times as needed to get the right size file.
You could then use a similiar approach for binary files by looking for .exes or .dlls.

For text files you might have some success taking an english word list and simply pulling words from it at random. This wont produce real english text but I would guess it would produce a letter frequency similar to what you might find in english.
For a more structured approach you could use a Markov chain trained on some large free english text.

Why don't you just take Lorem Ipsum and create a long string in memory before your output. The text should expand at a rate of O(log n) if you double the amount of text you have every time. You can even calculate the total length of the data before hand allowing you to not suffer from the having to copy contents to a new string/array.
Since your buffer is only 512k or whatever you set it to be, you only need to generate that much data before writing it, since that is only the amount you can push to the file at one time. You are going to be writing the same text over and over again, so just use the original 512k that you created the first time.

Wikipedia is excellent for compression testing for mixed text and binary. If you need benchmark comparisons, the Hutter Prize site can provide a high water mark for the first 100mb of Wikipedia. The current record is a 6.26 ratio, 16 mb.

Thanks for all the quick input.
I decided to consider the problems of speed and "naturalness" separately. For the generation of natural-ish text, I have combined a couple ideas.
To generate text, I start with a few text files from the project gutenberg catalog, as suggested by Mark Rushakoff.
I randomly select and download one document of that subset.
I then apply a Markov Process, as suggested by Noldorin, using that downloaded text as input.
I wrote a new Markov Chain in C# using Pike's economical Perl implementation as an example. It generates a text one word at a time.
For efficiency, rather than use the pure Markov Chain to generate 1gb of text one word at a time, the code generates a random text of ~1mb and then repeatedly takes random segments of that and globs them together.
UPDATE: As for the second problem, the speed - I took the approach to eliminate as much IO as possible, this is being done on my poor laptop with a 5400rpm mini-spindle. Which led me to redefine the problem entirely - rather than generating a FILE with random content, what I really want is the random content. Using a Stream wrapped around a Markov Chain, I can generate text in memory and stream it to the compressor, eliminating 8g of write and 8g of read. For this particular test I don't need to verify the compression/decompression round trip, so I don't need to retain the original content. So the streaming approach worked well to speed things up massively. It cut 80% of the time required.
I haven't yet figured out how to do the binary generation, but it will likely be something analogous.
Thank you all, again, for all the helpful ideas.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.