How do I get the difference between 2 large sets in .net

How do I get the difference between 2 large sets in .net - c#

I need to get the set of GUIDs in a remote database which do not exist in an IEnumerable (for context, this is coming from a Lucene index). There are potentially many millions of these Guids.
I currently think that inserting the IEnumerable to the database and doing the difference there will be too expensive (the inserts will hammer the database), but I am prepared to be proven wrong!
Reading both sets into memory is also infeasible due to the amount of data - our existing solution does this and fails with very large sets.
I would like a solution which can operate on a small subset of the data at a time so that we have a constant memory footprint. We have an idea as to how to roll our own implementation of this, but it is non-trivial, so would obviously rather use an existing one if it exists.
If anybody has any recommendations for an existing solution, I'd be grateful to hear them!

You could use SqlBulkCopy to load the guids very fast to the database(if it is SQL-Server).

Related

Best way to manage a large amount of data in memory?

I'm trying to find the best way to manage a large amount of data in memory using C# without access a database. (the db will be used to store just part of these information when they become final)
When I say "large amount of data" I'm talking about hundreds of megabytes and I will like to manage a complex structure, not only something like a table with millions of records.
I need to search inside them as fast as possible and I need to be able to remove part of them when they become obsolete.
Luckily I can split this data in groups that don't need to be related in some way...so I don't need to find or update a row between millions, but something like find a group between lets say 50,000 other, search, add and remove data in this group and delete all the group when it become obsolete.
I have some projects that already manage data in memory but nothing so huge so I don't know if these methods are applicable also in this situazion:
-I used the .Net cache object but I never worked with more that 10 or 20 megabytes
-private static List data = new List(); on which I stored groups of data in xml format, but I never worked with more than a couple of megabyte
-datatable objects, one for group and also in this last case I never worked with more than 10 megabytes and I had problems to manage the access because datatables aren't thread safe
What could be the best way to manage this kind of situation? There is any kind of limit of Windows or of the .Net framework that can create me problems?

You should be fine with using your memory to store data in a data structure and playing around with data. The trade off is that the memory will not be available to other applications running on same server. Also when you try to commit the data from Memory to disk/DB the time taken for large data is more.
Depending on your data, the structure has to be defined. If the data is not inter related you should create different objects for each entity.
You will also need to device a strategy to upate/refresh your cache. Either it can be hourly, daily, weekly; depending on your needs.

In-memory Cache - SQLite vs System.Data.DataTable

I am at the very begining stage of designing an in-memory cache in C# (which will run as a windows service). Once in production, this is expected to hold close to a million objects (various types) on average. Some cache items can be up to 10MB (or more) in size.
I considered a variety of data sotrage solutions and i have now decided to go for either a DataTable or an SQLite in-memory instance as the cache store. At this point my questions are
How do you think the performance of a DataTable will be with this many number of records?
Do you think going with an SQLite solution is an overkill?(Since SQLite is designed as a 'database', i may not really want all those database related plumbing activities)
Performance is the highest priority for me.
EDIT
Adding some more specifics.
These cache items are not just key-value pairs, they have two more
(as of now) properties (pinned and locked items), which can affect
their availability. Every look up is going to include all the three
properties.
Memcached has been considered, but at this point that is not an
option mainly due to our SLA constraints (That’s all I can say about
it).
Not all items are of 10MB in size.
I am pretty sure that many of these items are going to be mere
numerical and small string values.
I believe, availability of RAM is not an issue.
Thanks in advance,
James

1: TERRIBLE. DataTbles are slow and Memory hogs, that wont magically Change for large items.
2: You tell us.
Have you considered using a simple dictionary? Key/Value pairs, you know.
The answers really dpend on what you plan doing with the Cache.

If every item is 1 MB that is 1 TB of memory.
You have 1 TB of memory to dedicate to this?
A database on a solid state disk may be a better design.
DataTable is large and slow.
How are you going to look the items up?
Are you going to have a complete key?
Are you going to have enough memeory?
If so dictionary.

Speed up UniVerse access times using UniObjects

I am accessing a UniVerse database and reading out all the records in it for the purpose of synchronizing it to a MySQL database which is used for compatibility with some other applications which use the data. Some of the tables are >250,000 records long with >100 columns and the server is rather old and still used by many simultaneous users and so it takes a very ... long ... time to read the records sometimes.
Example: I execute SSELECT <file> TO 0 and begin reading through the select list, parsing each record into our data abstraction type and putting it in a .NET List. Depending on the moment, fetching each record can take between 250ms to 3/4 second depending on database usage. Removing the methods for extraction only speeds it up marginally since I think it still downloads all of the record information anyway when I call UniFile.read even if I don't use it.
Reading 250,000 records at this speed is prohibitively slow, so does anyone know a way I can speed this up? Is there some option I should be setting somewhere?

Do you really need to use SSELECT (sorted select)? The sorting on record key will create an additional performance overhead. If you do not need to synchronise in a sorted manner just use a plain SELECT and this should improve the performance.
If this doesn't help then try to automate the synchronisation to run at a time of low system usage, when either few or no users are logged onto the UniVerse system, if at all possible.
Other than that it could be that some of the tables you are exporting are in need of a resize. If they are not dynamic files (automatic-resizing - type 30), they may have gone into overflow space on disk.
To find out the size of your biggest tables and to see if they have gone into overflow you can use commands such as FILE.STAT and HASH.HELP at the command line to retrieve more information. Use HELP FILE.STAT or HELP HASH.HELP to look at the documentation for these commands, in order to extract the information that you need.
If these commands show that your files are of type 30, then they are automatically resized by the database engine. If however the file types are anything from type 2 to 18 the HASH.HELP command may recommend changes you can make to the table size to increase it's performance.
If none of this helps then you could check for useful indexes on the tables using LIST.INDEX TABLENAME ALL, which you could maybe use to speed up the selection.

Ensure your files are sized correctly using ANALYZE-FILE fileName. If not dynamic ensure there is not too much overflow.
Using SELECT instead of SSELECT will mean you are reading data from the database sequentially rather than randomly and be signicantly faster.
You should also investigate how you are extracting the data from each record and putting it into a list. Usually the pick data separators chars 254, 253 and 252 will not be compatible with the external database and need to be converted. How this is done can make an enormous difference to the performance.
It is not clear from the initial post, however a WRITESEQ would probably be the most efficient way to output the file data.

Database with a table containing 700 million records [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the performance characteristics of sqlite with very large database files?
I want to create a .Net application that uses a database that will contain around 700 million records in one of its tables. I wonder if the performance of SQLite would satisfy this scenario or should I use SQL Server. I like the portability that SQLite gives me.

Go for SQL Server for sure. 700 million records in SQLite is too much.
With SQLite you have following limitation
Single process write.
No mirroring
No replication
Check out this thread: What are the performance characteristics of sqlite with very large database files?

700m is a lot.
To give you an idea. Let's say your record size was 4 bytes (essentially storing a single value), then your DB is going to be over 2GB. If your record size is something closer to 100 bytes then it's closer to 65GB... (that's not including space used by indexes, and transaction log files, etc).
We do a lot of work with large databases and I'd never consider SQLLite for anything of that size. Quite frankly, "Portability" is the least of your concerns here. In order to query a DB of that size with any sort of responsiveness you will need an appropriately sized database server. I'd start with 32GB of RAM and fast drives.
If it's write heavy 90%+, you might get away with smaller RAM. If it's read heavy then you will want to try and build it out so that the machine can load as much of the DB (or at least indexes) in RAM as possible. Otherwise you'll be dependent on disk spindle speeds.

SQLite SHOULD be able to handle this much data. However, you may have to configure it to allow it to grow to this size, and you shouldn't have this much data in an "in-memory" instance of SQLite, just on general principles.
For more detail, see this page which explains the practical limits of the SQLite engine. The relevant config settings are the page size (normally 64KB) and page count (up to a 64-bit int's max value of approx 2.1 billion). Do the math, and the entire database can take up more than 140TB. A database consisting of a single table with 700m rows would be on the order of tens of gigs; easily manageable.
However, just because SQLite CAN store that much data doesn't mean you SHOULD. The biggest drawback of SQLite for large datastores is that the SQLite code runs as part of your process, using the thread on which it's called and taking up memory in your sandbox. You don't get the tools that are available in server-oriented DBMSes to "divide and conquer" large queries or datastores, like replication/clustering. In dealing with a large table like this, insertion/deletion will take a very long time to put it in the right place and update all the indexes. Selection MAY be livable, but only in indexed queries; a page or table scan will absolutely kill you.

I've had tables with similar record counts and no problems retrieval wise.
For starters, the hardware and allocation to the server is where you can start. See this for examples: http://www.sqlservercentral.com/blogs/glennberry/2009/10/29/suggested-max-memory-settings-for-sql-server-2005_2F00_2008/
Regardless of size or number of records as long as you:
create indexes on foreign key(s),
store common queries in Views (http://en.wikipedia.org/wiki/View_%28database%29),
and maintain the database and tables regularly
you should be fine. Also, setting the proper column type/size for each column will help.

Optimal storage of data structure for fast lookup and persistence

Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.

After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.

I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.

2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.