High volume blacklist contains operation - performance in C# - c#

I am working on desktop application which need perform web site access check. I have a huge black lists on PC where application is running, and faced with task:
How to perform fastest check over those black lists?
I'm using C#/.NET development stack, currently my idea load all those lists into hashset and invoke Contains method, but I not sure that this is good idea to load it all into memory, maybe you can suggest another way which save memory from one side and will work as fast as it can from another?
The files are in form of plain text, and in the region of megabytes but this size is expected to grow.
UPDATE:
I found black lists of web site here after download and unzip it the size of data about 80 megabytes. So I not sure that keep all data in memory good idea.
UPDATE 2
I've created perfomance test, downloaded blacklist with 2339643
items.
Loaded it into HashSet and perform 1000 iterations to check
speed.
Results:
The maximum amount of time which Contains method take: 0.2
milliseconds (this is first call)
Second call take about '0.0164' milliseconds
milliseconds and other even less. The perfomance is good.
But application where I run test take about 250MB of system memory which
is not so good as HashSet perfomance.

You can use a HashSet to store your black list, this data structure allows O(1) amortised time complexity for inserts and checks if the item is present in the set.
If you need something more scalable, you can consider brining in redis or memcached.
Reading through comments, I would consider creating a web service that performs a check. A user can query web service, which in turn would query redis or memchached or slq server if you don't need it all in memory. Alternatively, I would suggest looking at whitelisting, if your black lists grow too much this could indicate a problem with the current approach.

Related

What is the fastest way to persistently increment a list of numbers from multiple threads?

My application has different tasks each one posting an XML Document through each HTTP POST on a different endpoint. For every thread I need to keep count of the message I sent, which is identified by a unique incremental number.
I need a mechanism that, after a message has been received by the endpoint will save the last message id sent, so that if there is a problem and the application needs to restart it won't send the same message again, and will restart from where it currently was.
If I don't persist the counters, on my laptop I can manage to obtain a throughput of about 100 messages processed per second for every queue with 5 tasks running. My goal is to achieve no more than a 10/15% reduction in throughput by persisting the counters.
Using SQL Server for saving the counters, with a row for every tasks gives me a 50% decrease in throughput. Saving the counter value on a text file for every task is a bit faster but still far from my goal. I am looking for a way to persist such information so that I can be as close as possible to my goal. I thought that maybe appending the last processed Id rather than updating it could help me in avoiding possible write locks, but the bottom line is that I don't care if for the sake of performance I will have to waste disk space or have a higher startup time for reading the last counter.
In your experience what might be a fast way to avoid contentions and safely persist data from multiple tasks even at the cost of more disk space?
You can get pretty good performance with an ESENT storage, via the ManagedEsent - PersistentDictionary wrapper.
The PersistentDictionary class is concurrent and provides real concurrent access to the ESENT backend. You would represent everything in key-value pair format.
Give it a try, it is not much code to write.
ESENT is an in-process database engine, disk based + in-memory caching, used throughout several Windows components (Search, Exchange, etc). It does provide transactional support, which is what you're after.
It has been included in all versions of Windows since 2000 so you don't need to install any dependencies other than ManagedEsent.
You would probably want to define something like this:
var dictionary = new PersistentDictionary<Guid, int>("ThreadStorage");
The key, I assume, should be something unique (maybe even the service endpoint) so that you are able to re-map it after a restart. The value is the last message identifier.
I am pasting below, shamelessly, their performance benchmarks:
Sequential inserts 32,000 entries/second
Random inserts 17,000 entries/second
Random Updates 36,000 entries/second
Random lookups (database cached in memory) 137,000 entries/second
Linq queries (range of records) 14,000 queries/second
You fit in the Random Updates case, which as you can see offers a really good throughput.
I faced the same issue as OP asked.
I used SQL server Sequence Numbers (with CREATE SEQUENCE).
However, the accepted answer is a good solution to avoid using SQL server.

Application so slow to start from DVD

I have an application with a large database (about 3.5 GB) that I need to run it from a read only file system like DVD. So my program works well from a hard disk but it's so slow to start from a DVD.
My question is how do I optimize my program to run fast on a DVD?
You'll have to profile your application, there is no silver bullet that makes your app load three times as fast. Analyze, profile, see what data is causing the latency
This depends entirely on what kind of database you mean. I'll assume it's row based.
If you wish to make a database fast to read from, the first step is probably to sort the database. This is critical because it makes it possible to hunt down specific rows very quickly using a binary search.
Loading 3.5 Gb into ram to search through from a DVD is going to take as nearly as long as ripping a DVD, so that's why your program's going to be slow to start. Consider making an index that points to the locations of certain rows, like page numbers for the start of each letter in a dictionary. Then, you only need to load small portions of your database to find the rows you need. Then, slowly build up the dictionary in RAM by loading portions in the order of requirements (i.e, if you search for something, load that portion first).
Specifically to DVD, there's not much you can do to make it load faster. Consider a streaming compression type (GZip maybe, C# supports this natively) to allow you to pull data faster.
Again, it depends entirely on what you're doing, these are just general suggestions.

Finding Changes between 2 HUGE zone (text) files

I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.

Optimal storage of data structure for fast lookup and persistence

Scenario
I have the following methods:
public void AddItemSecurity(int itemId, int[] userIds)
public int[] GetValidItemIds(int userId)
Initially I'm thinking storage on the form:
itemId -> userId, userId, userId
and
userId -> itemId, itemId, itemId
AddItemSecurity is based on how I get data from a third party API, GetValidItemIds is how I want to use it at runtime.
There are potentially 2000 users and 10 million items.
Item id's are on the form: 2007123456, 2010001234 (10 digits where first four represent the year).
AddItemSecurity does not have to perform super fast, but GetValidIds needs to be subsecond. Also, if there is an update on an existing itemId I need to remove that itemId for users no longer in the list.
I'm trying to think about how I should store this in an optimal fashion. Preferably on disk (with caching), but I want the code maintainable and clean.
If the item id's had started at 0, I thought about creating a byte array the length of MaxItemId / 8 for each user, and set a true/false bit if the item was present or not. That would limit the array length to little over 1mb per user and give fast lookups as well as an easy way to update the list per user. By persisting this as Memory Mapped Files with the .Net 4 framework I think I would get decent caching as well (if the machine has enough RAM) without implementing caching logic myself. Parsing the id, stripping out the year, and store an array per year could be a solution.
The ItemId -> UserId[] list can be serialized directly to disk and read/write with a normal FileStream in order to persist the list and diff it when there are changes.
Each time a new user is added all the lists have to updated as well, but this can be done nightly.
Question
Should I continue to try out this approach, or are there other paths which should be explored as well? I'm thinking SQL server will not perform fast enough, and it would give an overhead (at least if it's hosted on a different server), but my assumptions might be wrong. Any thought or insights on the matter is appreciated. And I want to try to solve it without adding too much hardware :)
[Update 2010-03-31]
I have now tested with SQL server 2008 under the following conditions.
Table with two columns (userid,itemid) both are Int
Clustered index on the two columns
Added ~800.000 items for 180 users - Total of 144 million rows
Allocated 4gb ram for SQL server
Dual Core 2.66ghz laptop
SSD disk
Use a SqlDataReader to read all itemid's into a List
Loop over all users
If I run one thread it averages on 0.2 seconds. When I add a second thread it goes up to 0.4 seconds, which is still ok. From there on the results are decreasing. Adding a third thread brings alot of the queries up to 2 seonds. A forth thread, up to 4 seconds, a fifth spikes some of the queries up to 50 seconds.
The CPU is roofing while this is going on, even on one thread. My test app takes some due to the speedy loop, and sql the rest.
Which leads me to the conclusion that it won't scale very well. At least not on my tested hardware. Are there ways to optimize the database, say storing an array of int's per user instead of one record per item. But this makes it harder to remove items.
[Update 2010-03-31 #2]
I did a quick test with the same data putting it as bits in memory mapped files. It performs much better. Six threads yields access times between 0.02s and 0.06s. Purely memory bound. The mapped files were mapped by one process, and accessed by six others simultaneously. And as the sql base took 4gb, the files on disk took 23mb.
After much testing I ended up using Memory Mapped Files, marking them with the sparse bit (NTFS), using code from NTFS Sparse Files with C#.
Wikipedia has an explanation of what a sparse file is.
The benefits of using a sparse file is that I don't have to care about what range my id's are in. If I only write id's between 2006000000 and 2010999999, the file will only allocate 625,000 bytes from offset 250,750,000 in the file. All space up to that offset is unallocated in the file system. Each id is stored as a set bit in the file. Sort of treated as an bit array. And if the id sequence suddenly changes, then it will allocate in another part of the file.
In order to retrieve which id's are set, I can perform a OS call to get the allocated parts of the sparse file, and then I check each bit in those sequences. Also checking if a particular id is set is very fast. If it falls outside the allocated blocks, then it's not there, if it falls within, it's merely one byte read and a bit mask check to see if the correct bit is set.
So for the particular scenario where you have many id's which you want to check on with as much speed as possible, this is the most optimal way I've found so far.
And the good part is that the memory mapped files can be shared with Java as well (which turned out to be something needed). Java also has support for memory mapped files on Windows, and implementing the read/write logic is fairly trivial.
I really think you should try a nice database before you make your decision. Something like this will be a challenge to maintain in the long run. Your user-base is actually quite small. SQL Server should be able to handle what you need without any problems.
2000 users isn't too bad but with 10 mil related items you really should consider putting this into a database. DBs do all the storage, persistence, indexing, caching etc. that you need and they perform very well.
They also allow for better scalability into the future. If you suddenly need to deal with two million users and billions of settings having a good db in place will make scaling a non-issue.

Reading from SerialPort & Memory Management - C#

I am working on a program that reads in from a serial port, then parses, formats, and displays the information appropriately. This program, however, needs to be run for upwards of 12+ hours - constantly handling a stream of incoming data. I am finding that as I let my app run for a while, the memory usage increases at a linear rate - not good for a 12 hour usage.
I have implemented a logger that writes the raw incoming binary data to a file - is there a way I can utilize this idea to clear my memory cache at regular intervals? I.e. how can I, every so often, write to the log file in such a way that the data doesn't need to be stored in memory?
Also - are there other aspects of a Windows Form Application that would contribute to this? E.g. I print the formatted strings to a textbox, which ends up displaying the entire string. Because this is running for so long, it easily displays hundreds of thousands of lines of text. Should I be writing this to a file and clearing the text? Or something else?
Obviously, if the string grows over time, your app's memory usage will also grow over time. Also, WinForms textboxes can have trouble dealing with very large strings. How large does the string get?
Unless you really want to display the entire string onscreen, you should definitely clear it periodically (depending on your users' expectations); this will save memory and probably improve performance.
Normally, memory management in .NET is completely automatic. You should be careful about extrapolating a short observations (minutes) to a 12 hour period. And please note that TaskManager is not a very good tool to measure memory usage.
Writing the incoming data should not increase memory usage significantly. But there are a few thing you should avoid doing, and concatenating to a string over and over is one of them. Your TextBox is probably costing a lot more than you seem to think. Using a ListBox would be more efficient. And easier.
I have several serial applications which run either as an application or as a windows service. These are required to be up 24/7-365. The best mechanism I have found to avoid this same problem is two-fold.
1) Write the information out to a log file. For a service, this is the only way of getting the info out. The log file does not increase your memory usage.
2) For the application, write the information out to a log file as well as put it into a listbox. I generally limit the listbox to the last 500 or 1000 entries. With the newer .net controls, the listboxes are virtualized which helps but you also don't run into other memory issues such as the textbox concatenation.
You can take a system down with a textbox by constantly appending the string over a number of hours as it is not intended for that kind of abuse out of the box.

Categories

Resources