Lucene.net runs out of memory for large index - c#

I have a Lucene.net index with 10 fields, some stored and some indexed, with 460 million documents. The index is about 250GB. I'm using Lucene.net 3.0.3 and every time I do a search I easily eat up 2GB+ in RAM, which causes my 32 bit application to get out of memory exceptions. I unfortunately cannot run the app as a 64 bit process due to other 32 bit dependencies.
As far as I know I'm following Lucene best practices:
One open index writer that writes documents in batches
A shared reader that doesn't close and reopen itself across searches
The index searcher has a termInfosIndexDivisor set to 4, which didn't seem to make a difference. I even tried setting it to something huge like 1000 but didn't notice any memory changes.
Fields that do not need to be subsearched aren't analyzed (i.e. full string searching only) and fields that don't need to be retrieved back from the search aren't stored.
I'm using the default StandardAnalyzer for both indexing and searching.
If I prune the data and make a smaller index, then things do work. When I have an index that is around 50GB in size I can search it with only about 600MB of RAM
However, I do have a sort applied on one of the fields, but even without the sort the memory usage is huge for any search. I don't particularly care about document score, more that the document exists in my index, but I'm not sure if somehow ignoring the score calculation will help with the memory usage.
I recently upgraded from Lucene.net 2.9.4 to Lucene.net 3.0.3 thinking that that might help, but the memory usage looks about the same between the two versions.
Frankly I'm not sure if this index is just too large for a single machine to feasbily search or not. Most examples I find talk about indexes 20-30GB in size or less so maybe this isn't possible, but I wanted to at least ask.
If anyone has any suggestions on what I can do to make this useable that would be great. I am willing to sacrifice search speed for memory usage if possible.

You CAN run the app in 64 bit - make a separate process for the lucene part, use remoting to communicate with it (or WCF). Finished. Standard Approach.
You think about Splitting it already, so heck, Isolate it and put it on 64 bit.

Related

providing random read access to a very large (50GB+) ASCII file

My task is to provide random read access to a very large (50GB+) ASCII text file (processing requests for nth line/nth word in nth line) in a form of C# console app.
After googling and reading for a few days, I've come to such vision of implementation:
Since StreamReader is good at sequential access, use it to build an index of lines/words in file (List<List<long>> map, where map[i][j] is position where jth word of ith line starts). And then use the index to access file through MemoryMappedFile, since it good at providing random access.
Are there some obvious flaws in the solution? Would it be optimal for a given task?
UPD: It will be executed at 64bit system.
It seems fine, but if you're using MemoryMapping then your program will only work on a 64-bit system because you're excessing the effective 2GB address space available.
You'll be fine with just using a FileStream and calling .Seek() to jump to the selected offset as appropriate, so I don't see a need for using MemoryMapped files.
I believe your solution is a good start - even thou List container is not the best Map container - Lists are very slow to read arbitrary elements.
I would test whether doing List<List<long>> map is the best in terms of memory/speed tradeoff - since OS caches memory maps at page boundaries (4096 bytes on x86/x64), it might be actually faster to only look up the address of the start of each line, and then scan the line looking for words.
Obviously, this approach would only work on 64bit OS, but the performance benefit of an MMap is significant - this is one of the few places where going 64bit matters a lot - database applications :)

Application so slow to start from DVD

I have an application with a large database (about 3.5 GB) that I need to run it from a read only file system like DVD. So my program works well from a hard disk but it's so slow to start from a DVD.
My question is how do I optimize my program to run fast on a DVD?
You'll have to profile your application, there is no silver bullet that makes your app load three times as fast. Analyze, profile, see what data is causing the latency
This depends entirely on what kind of database you mean. I'll assume it's row based.
If you wish to make a database fast to read from, the first step is probably to sort the database. This is critical because it makes it possible to hunt down specific rows very quickly using a binary search.
Loading 3.5 Gb into ram to search through from a DVD is going to take as nearly as long as ripping a DVD, so that's why your program's going to be slow to start. Consider making an index that points to the locations of certain rows, like page numbers for the start of each letter in a dictionary. Then, you only need to load small portions of your database to find the rows you need. Then, slowly build up the dictionary in RAM by loading portions in the order of requirements (i.e, if you search for something, load that portion first).
Specifically to DVD, there's not much you can do to make it load faster. Consider a streaming compression type (GZip maybe, C# supports this natively) to allow you to pull data faster.
Again, it depends entirely on what you're doing, these are just general suggestions.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Reading lines from a .NET text/scintilla box without using too much memory?

I have to create a C# program that deals well with reading in huge files.
For example, I have a 60+ mB file. I read all of it into a scintilla box, let's call it sci_log. The program is using roughly 200mB of memory with this and other features. This is still acceptable (and less than the amount of memory used by Notepad++ to open this file).
I have another scintilla box, sci_splice. The user inputs a search term and the program searches through the file (or sci_log if the file length is small enough--it doesn't matter because it happens both ways) to find a regexp.match. When it finds a match, it concatenates that line with a string that has previous matches and increases a temporary count variable. When count is 100 (or 150, or 200, any number really), then I put the output in sci_splice, call GC.Collect(), and repeat for the next 100 lines (setting count = 0, nulling the string).
I don't have the code on me right now as I'm writing this from my home laptop, but the issue with this is it's using a LOT of memory. The 200mB mem usage jumps up to well over 1gB with no end in sight. This only happens on a search with a lot of regexp matches, so it's something with the string. But the issue is, wouldn't the GC free up that memory? Also, why does it go up so high? It doesn't make sense for why it would more than triple (worst possible case). Even if all of that 200mB was just the log in memory, all it's doing is reading each line and storing it (at worst).
After some more testing, it looks like there's something wrong with Scintilla using a lot of memory when adding lines. The initial read of the lines has a memory spike up to 850mB for a fraction of a second. Guess I need to just page the output.
Don't call GC.Collect. In this case I don't think it matters because I think this memory is going to end up on the Large Object Heap (LOH). But the point is .Net knows a lot more about memory management than you do; leave it alone.
I suspect you are looking at this using Task Manager just by the way you are describing it. You need to instead use at least Perfmon. Anticipating you have not used it before go here and do pretty much what Tess does to where it says Get a Memory Dump. Not sure you are ready for WinDbg but that maybe your next step.
Without seeing code there is almost no way to know what it is going on. The problem could be inside of Scintilla too, but I would check through what you are doing first. By running perfmon you may at least get more information to figure out what to do next.
If you are using System.String to store your matching lines, I suggest you try replacing it with System.Text.StringBuilder and see if this makes any difference.
Try http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(VS.100).aspx

Optimal storage of data structure for fast lookup and persistence

Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.
After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.
I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.
2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

Categories

Resources