Optimal storage of data structure for fast lookup and persistence

Optimal storage of data structure for fast lookup and persistence - c#

Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

Related

Best way to manage a large amount of data in memory?

I'm trying to find the best way to manage a large amount of data in memory using C# without access a database. (the db will be used to store just part of these information when they become final)
When I say "large amount of data" I'm talking about hundreds of megabytes and I will like to manage a complex structure, not only something like a table with millions of records.
I need to search inside them as fast as possible and I need to be able to remove part of them when they become obsolete.
Luckily I can split this data in groups that don't need to be related in some way...so I don't need to find or update a row between millions, but something like find a group between lets say 50,000 other, search, add and remove data in this group and delete all the group when it become obsolete.
I have some projects that already manage data in memory but nothing so huge so I don't know if these methods are applicable also in this situazion:
-I used the .Net cache object but I never worked with more that 10 or 20 megabytes
-private static List data = new List(); on which I stored groups of data in xml format, but I never worked with more than a couple of megabyte
-datatable objects, one for group and also in this last case I never worked with more than 10 megabytes and I had problems to manage the access because datatables aren't thread safe
What could be the best way to manage this kind of situation? There is any kind of limit of Windows or of the .Net framework that can create me problems?

You should be fine with using your memory to store data in a data structure and playing around with data. The trade off is that the memory will not be available to other applications running on same server. Also when you try to commit the data from Memory to disk/DB the time taken for large data is more.
Depending on your data, the structure has to be defined. If the data is not inter related you should create different objects for each entity.
You will also need to device a strategy to upate/refresh your cache. Either it can be hourly, daily, weekly; depending on your needs.

Speed up UniVerse access times using UniObjects

I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?

Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.

Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.

Database with a table containing 700 million records [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the performance characteristics of sqlite with very large database files?
I want to create a .Net application that uses a database that will contain around 700 million records in one of its tables. I wonder if the performance of SQLite would satisfy this scenario or should I use SQL Server. I like the portability that SQLite gives me.

Go for SQL Server for sure. 700 million records in SQLite is too much.
With SQLite you have following limitation
Single process write.
No mirroring
No replication
Check out this thread: What are the performance characteristics of sqlite with very large database files?

700m is a lot.
To give you an idea. Let's say your record size was 4 bytes (essentially storing a single value), then your DB is going to be over 2GB. If your record size is something closer to 100 bytes then it's closer to 65GB... (that's not including space used by indexes, and transaction log files, etc).
We do a lot of work with large databases and I'd never consider SQLLite for anything of that size. Quite frankly, "Portability" is the least of your concerns here. In order to query a DB of that size with any sort of responsiveness you will need an appropriately sized database server. I'd start with 32GB of RAM and fast drives.
If it's write heavy 90%+, you might get away with smaller RAM. If it's read heavy then you will want to try and build it out so that the machine can load as much of the DB (or at least indexes) in RAM as possible. Otherwise you'll be dependent on disk spindle speeds.

SQLite SHOULD be able to handle this much data. However, you may have to configure it to allow it to grow to this size, and you shouldn't have this much data in an "in-memory" instance of SQLite, just on general principles.
For more detail, see this page which explains the practical limits of the SQLite engine. The relevant config settings are the page size (normally 64KB) and page count (up to a 64-bit int's max value of approx 2.1 billion). Do the math, and the entire database can take up more than 140TB. A database consisting of a single table with 700m rows would be on the order of tens of gigs; easily manageable.
However, just because SQLite CAN store that much data doesn't mean you SHOULD. The biggest drawback of SQLite for large datastores is that the SQLite code runs as part of your process, using the thread on which it's called and taking up memory in your sandbox. You don't get the tools that are available in server-oriented DBMSes to "divide and conquer" large queries or datastores, like replication/clustering. In dealing with a large table like this, insertion/deletion will take a very long time to put it in the right place and update all the indexes. Selection MAY be livable, but only in indexed queries; a page or table scan will absolutely kill you.

I've had tables with similar record counts and no problems retrieval wise.
For starters, the hardware and allocation to the server is where you can start. See this for examples: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/
Regardless of size or number of records as long as you:
create indexes on foreign key(s),
store common queries in Views (http://en.wikipedia.org/wiki/View_%28database%29),
and maintain the database and tables regularly
you should be fine. Also, setting the proper column type/size for each column will help.

Memory size of a list of int

I have a list of int that's stored in the DB in the form of a string with commas in between (4345,324,24,2424,64567,33...). This string could become quite large and contain 2-3 thousand numbers. It's stored in the DB and used quite frequently.
I'm thinking that instead of reading it from the DB every time it's needed, it'd be better to store it in the session after it's loaded the first time.
How much memory would a list of 1,000 int require? Does the memory size also depend on the int itself such that storing a larger int (234,332) takes more space than a smaller int (544)?
Is it going to be better to read once and store in the session at the cost of memory space or better to read often and discard from memory after render.
Thanks for your suggestions.

I think you are heading in wrong direction. Storing in DB will likely be a better option, not in comma separated format, but as a table of int values.
Storing data in session will reduce scalability significantly. You might start having OutOfMemory exception and wondering why this is happening.
So my suggestion is read from DB when needed, apply appropriate indexes and it will be very fast.
The way you are heading is:
Day #1, 1 user - Hmm, should I store data in Session, why not. Should work fast. No need to query DB. Also easy to do.
Day #10, 5 users - Need to store another data structure, will put this to the session too, why not? Session is very fast.
Day #50, 10 users - There is a control that is haeavy to render, I will make it smart, render once and than put to the Session, will reuse it on every postback.
Day #100, 20 users - Sometimes the web site slow, don't know why. But it is just sometimes, so not a big deal.
Day #150, 50 users - It's got slow. Need better CPU and memory? We need to buy a better server, the hardware is old.
Day #160, 60 users - Got a new server, works much faster. Problem solved.
Day #200, 100 users - slow again, why? This is the newest the most expensive server!
Day #250, 150 users - application pool is getting recylced all the time. Why? OutOfMemoryException? what is this? I will google.
Day #300, 200 users - Users complain, we lose customers. I read about WinDbg, need to try using it.
Day #350, 200 users - Should we start using network load balancing, we can buy two servers! Bought server, tried to use, didn't work, a lot of dependencies on Session.
Day #400, 200 users - Can't get new customers, old customers go away. Started using WinDbg found out that almost all the memory is used by Session.
Day #450, 200 users - Starting a big project called 'Get rid of Session'.
Day #500, 250 users - The server is so fast now.
I've been there seen that. Basically my advice - don't go this way.

An int in C# is always 4 bytes (no matter what the value). A list of 1,000 ints is therefore ~4,000 bytes. I say approximately because the list structure will add some overhead. A few thousand ints in a list shouldn't be a problem for a modern computer.

I would not recommend storing it in the session, since that's going to cause memory pressure. If you have a series of integers tied to a single record, it sounds like you have a missing many to one relationship - why not store the ints in a separate table with a foreign key to the original table?

Integers are of a fixed size in .NET. Assuming you store it in an array instead of a List (since you are probably not adding to or removing from it), it would take up roughly 32 bits * the number of elements. 1000 ints in an array = roughly 32000 bits, or a little under 4 KB.

An int usually takes 32 bits (4 bytes), so 1000 of them would take about 4KB.
It doesn't matter how large the number is. They're always stored in the same space.

Is this list of int's unique to a session? If not, cache it at the server level and set an expiration on it. 1 copy of the list.
context.Cache.Add(...
I do this and refresh it every 5 minutes with a large amount of data. This way it's pretty "fresh" but only 1 connection takes the hit to populate it.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!

Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.

The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.

This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.

You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.