Hey everyone.
This place is like a goldmine of knowledge and it's helping me so much! My next query is:
I have byte data being sent to my c# socket server. I am converting it to an ascii string, then splitting the data based on a common character (like the bar | character) and using the data. Typically the first piece of data is a command as a 4 digit number. I can imagine this not being very efficient! What would be the best way to process the data is an receiving, efficiently?
Related, how I be trapping and processing commands? Multiple if statements or a large case/switch statement. I really need speed and efficiency.
Typically the first piece of data is a command as a 4 digit number. I can imagine this not being very efficient! What would be the best way to process the data is an receiving, efficiently?
No, converting a number to/from a string is not efficient. But the question is: Do it really matter? It sounds to me like you are trying to do premature optimizations. Do not do that. Your goal should be to write code that is easy to read and maintain. Do not do optimizations until someone actually complains about the performance.
Related, how I be trapping and processing commands? Multiple if statements or a large case/switch statement. I really need speed and efficiency.
Again. Determine that the command processing really is the bottle neck in your application.
The whole processing really depends on what you do with the incoming messages. You provide way to little information to give a proper answer. Create a new question (since two questions in one is not really allowed). Add code which shows your current handling and describe what you do not like about it.
If you really need the performance I guess you shouldn't use a string representation for your command but work directly on the bytes. Four numbers in string format are 32 of 64 bits (depending on which charset you are using) in size, whilst a single byte is sufficient to store a four digit number. Using a lot of branches (which if-statements are) also effects your performance.
My suggestion is that you reserve a fixed size prefix in your message for the command. You then use these bytes to lookup in O(1) in a table which command you should execute, this table can be filled with object that have a method execute. So you can do something table[command].execute().
That being said, I don't think the performance-gain would be that large and that you are better off (maintenance-wise) by using one of the serialization libraries out there.
Related
I find myself faced with a conundrum of which the answer probably falls outside of my expertise. I'm hoping someone can help.
I have an optimised and efficient query for fetching table (and linked) data, the actual contents of which are unimportant. However upon each read that data then needs to be processed to present the data in JSON format. As we're talking typical examples where a few hundred rows could have a few hundred-thousand associated rows this takes time. With multi-threading and a powerful CPU (i7 3960X) this processing is around 400ms - 800ms at 100% CPU. It's not a lot I know but why process it each time in the first place?
In this particular example, although everything I've ever read points to not doing so (as I understood it) I'm considering storing the computed JSON in a VARCHAR(MAX) column for fast reading.
Why? Well the data is read 100 times or more for every single write (change), it seems to me that given those numbers it would be far better to stored the JSON for optimised retrieval and re-compute and update it on the odd occasion the associations are changed - adding perhaps 10 to 20 ms to the time taken to write changes, but improving the reads by some large factor.
Your opinions on this would be much appreciated.
Yes, storing redundant information for performance reasons is pretty common. The first step is to measure the overhead - and it sounds like you've done that already (although I would also ask: what json serializer are you using? have you tried others?)
But fundamentally, yes that's ok, when the situation warrants it. To give an example: stackoverflow has a similar scenario - the markdown you type is relatively expensive to process into html. We could do that on every read, but we have insanely more reads than writes, so we cook the markdown at write, and store the html as well as the source markdown - then it is just a simple "data in, data out" exercise for most of the "show" code.
It would be unusual for this to be a common problem with json, though, since json serialization is a bit simpler and lots of meta-programming optimization is performed by most serializers. Hence my suggestion to try a different serializer before going this route.
Note also that the rendered json may need more network bandwidth that the original source data in TDS - so your data transfer between the db server and the application server may increase; another thing to consider.
So coming off of this question:
Which is fast comparison: Convert.ToInt32(stringValue)==intValue or stringValue==intValue.ToString()
I am looking a base type for my networked application to be stored in packets.
The Idea:
Packet class stores a list of (type)
Add objects to the packet class
Serialize and send it between machines
Deserialize into (type)
Convert (type) into the type of object you added originally.
Originally, I was using strings as (type). However, I am a bit dubious as every time I want to convert an int to a string, it seems like a tasking process. When I am communicating packets containing lots of uints to strings at 30FPS, I would like to make this process as fast as possible.
Therefore, I was wondering if byte[] would be a more suitable type. How fast is converting back and forth between a byte[] and ints/strings vs just strings to ints? BTW, I will not be sending a lot of strings on the network very often. Almost all of what I will be sending will be uints.
If you are using the same program on both ends, use BinarySerialization if possible. You are worried about speed; but unless this is just going between two processes on localhost, actual wire time, let alone latancy, will be orders of magnitude slower than any real conversion process.
Of course, don't concatenate strings; you will make a liar out of me.
The thing you need to save here is your coding time, plus the possibility of errors for rolling your own serialization. If you properly encapsulate the data transfer parts of your program, upgrading them would be easy. Trying to spend extra time making something fast is called premature optimization (google it - it's a valid argument - most of the time). If it is a bottleneck, leverage your encapsulated design, and change it. You won't spend that much extra time then if you'd done it first - but likely won't end up spending that time at all.
A warning about binary serialization. The types you are sending must be the same version and type name. If you can put the same version into production on both ends, easily, it's no worry. If you need more than this, or binaryserialization is too slow, look into FastJson, which makes big promises and is free, or something similar.
byte[] is the "natural" data type for socket operations, so this seems a good fit, ints/uints will be very fast to convert also. Strings are a bit different, but if you chose the natural encoding of the platform, this will be fast also.
Convert.ToInt32 is decently fast provided it does not fail. If it fails then you incur the overhead of a thrown/caught exception which is massive.
The byte [] vs. some other type dichotomy is false. The network transports all information as -- essentially -- an array of bytes. So whether a StreamReader wrapped around a NetworkStream is turning the byte [] into a String, or you are yourself, it's still getting done.
I have access to the .com zone files. A zone file is a text file with a list of domain names and their nameservers. It follows a format such as:
mydomain NS ns.mynameserver.com.
mydomain NS ns2.mynameserver.com.
anotherdomain NS nameservers.com.
notinalphadomain NS ns.example.com.
notinalphadomain NS ns1.example.com.
notinalphadomain NS ns2.example.com.
As you can see, there can be multiple lines for each domain (when there are multiple nameservers), and the file is NOT in alpha order.
These files are about 7GB in size.
I'm trying to take the previous file and the new file, and compare them to find:
What domains have been Added
What domains have been Removed
What domains have had nameservers changed
Since 7GB is too much to load the entire file into memory, Obviously I need to read in a stream. The method I've currently thought up as the best way to do it is to make several passes over both files. One pass for each letter of the alphabet, loading all the domains in the first pass that start with 'a' for example.
Once I've got all the 'a' domains from the old and new file, I can do a pretty simple comparison in memory to find the changes.
The problem is, even reading char by char, and optimizing as much as I've been able to think of, each pass over the file takes about 200-300 seconds, with collecting all the domains for the current pass's letter. So, I figure in its current state I'm looking at about an hour to process the files, without even storing the changes in the database (which will take some more time). This is on a dual quad core xeon server, so throwing more horsepower at it isn't much of an option for me.
This timing may not be a dealbreaker, but I'm hoping someone has some bright ideas for how to speed things up... Admittedly I have not tried async IO yet, that's my next step.
Thanks in advance for any ideas!
Preparing your data may help, both in terms of the best kind of code: the unwritten kind, and in terms of execution speed.
cat yesterday-com-zone | tr A-Z a-z | sort > prepared-yesterday
cat today-com-zone | tr A-Z a-z | sort > prepared-today
Now, your program does a very simple differences algorithm, and you might even be able to use diff:
diff prepared-today prepared-yesterday
Edit:
And an alternative solution that removes some extra processing, at the possible cost of diff execution time. This also assumes the use of GnuWin32 CoreUtils:
sort -f <today-com-zone >prepared-today
sort -f <yesterday-com-zone >prepared-yesterday
diff -i prepared-today prepared-yesterday
The output from that will be a list of additions, removals, and changes. Not necessarily 1 change record per zone (consider what happens when two domains alphabetically in order are removed). You might need to play with the options to diff to force it to not check for as many lines of context, to avoid great swaths of false-positive changes.
You may need to write your program after all to take the two sorted input files and just run them in lock-step, per-zone. When a new zone is found in TODAY file, that's a new zone. When a "new" zone is found in YESTERDAY file (but missing in today), that's a removal. When the "same" zone is found in both files, then compare the NS records. That's either no-change, or a change in nameservers.
The question has been already answered, but I'll provide a more detailed answer, with facts that are good for everyone to understand. I'll try to cover the existing solutions, and even how to distribute , with explanations of why things turned out as they did.
You have a 7 GB text file. Your disk lets us stream data at, let's be pessimistic, 20 MB/second. This can stream the whole thing in 350 seconds. That is under 6 minutes.
If we suppose that an average line is 70 characters, we have 100 million rows. If our disk spins at 6000 rpm, the average rotation takes 0.01 seconds, so grabbing a random piece of data off of disk can take anywhere from 0 to 0.01 seconds, and on average will take 0.005 seconds. This is called our seek time. If you know exactly where every record is, and seek to each line, it will take you 0.005 sec * 100,000,000 = 500,000 sec which is close to 6 days.
Lessons?
When working with data on disk you really want to avoid seeking. You want to stream data.
When possible, you don't want your data to be on disk.
Now the standard way to address this issue is to sort data. A standard mergesort works by taking a block, sorting it, taking another block, sorting it, and then merging them together to get a larger block. The merge operation streams data in, and writes a stream out, which is exactly the kind of access pattern that disks like. Now in theory with 100 million rows you'll need 27 passes with a mergesort. But in fact most of those passes easily fit in memory. Furthermore a clever implementation - which nsort seems to be - can compress intermediate data files to keep more passes in memory. This dataset should be highly structured and compressible, in which all of the intermediate data files should be able to fit in RAM. Therefore you entirely avoid disk except for reading and writing data.
This is the solution you wound up with.
OK, so that tells us how to solve this problem. What more can be said?
Quite a bit. Let's analyze what happened with the database suggestions. The standard database has a table and some indexes. An index is just a structured data set that tells you where your data is in your table. So you walk the index (potentially doing multiple seeks, though in practice all but the last tend to be in RAM), which then tells you where your data is in the table, which you then have to seek to again to get the data. So grabbing a piece of data out of a large table potentially means 2 disk seeks. Furthermore writing a piece of data to a table means writing the data to the table, and updating the index. Which means writing in several places. That means more disk seeks.
As I explained at the beginning, disk seeks are bad. You don't want to do this. It is a disaster.
But, you ask, don't database people know this stuff? Well of course they do. They design databases to do what users ask them to do, and they don't control users. But they also design them to do the right thing when they can figure out what that is. If you're working with a decent database (eg Oracle or PostgreSQL, but not MySQL), the database will have a pretty good idea when it is going to be worse to use an index than it is to do a mergesort, and will choose to do the right thing. But it can only do that if it has all of the context, which is why it is so important to push work into the database rather than coding up a simple loop.
Furthermore the database is good about not writing all over the place until it needs to. In particular the database writes to something called a WAL log (write access log - yeah, I know that the second log is redundant) and updates data in memory. When it gets around to it it writes changes in memory to disk. This batches up writes and causes it to need to seek less. However there is a limit to how much can be batched. Thus maintaining indexes is an inherently expensive operation. That is why standard advice for large data loads in databases is to drop all indexes, load the table, then recreate indexes.
But all this said, databases have limits. If you know the right way to solve a problem inside of a database, then I guarantee that using that solution without the overhead of the database is always going to be faster. The trick is that very few developers have the necessary knowledge to figure out the right solution. And even for those who do, it is much easier to have the database figure out how to do it reasonably well than it is to code up the perfect solution from scratch.
And the final bit. What if we have a cluster of machines available? The standard solution for that case (popularized by Google, which uses this heavily internally) is called MapReduce. What it is based on is the observation that merge sort, which is good for disk, is also really good for distributing work across multiple machines. Thus we really, really want to push work to a sort.
The trick that is used to do this is to do the work in 3 basic stages:
Take large body of data and emit a stream of key/value facts.
Sort facts, partition them them into key/values, and send off for further processing.
Have a reducer that takes a key/values set and does something with them.
If need be the reducer can send the data into another MapReduce, and you can string along any set of these operations.
From the point of view of a user, the nice thing about this paradigm is that all you have to do is write a simple mapper (takes a piece of data - eg a line, and emits 0 or more key/value pairs) and a reducer (takes a key/values set, does something with it) and the gory details can be pushed off to your MapReduce framework. You don't have to be aware of the fact that it is using a sort under the hood. And it can even take care of such things as what to do if one of your worker machines dies in the middle of your job. If you're interested in playing with this, http://hadoop.apache.org/mapreduce/ is a widely available framework that will work with many other languages. (Yes, it is written in Java, but it doesn't care what language the mapper and reducer are written in.)
In your case your mapper could start with a piece of data in the form (filename, block_start), open that file, start at that block, and emit for each line a key/value pair of the form domain: (filename, registrar). The reducer would then get for a single domain the 1 or 2 files it came from with full details. It then only emits the facts of interest. Adds are that it is in the new but not the old. Drops are that it is in the old but not the new. Registrar changes are that it is in both but the registrar changed.
Assuming that your file is readily available in compressed form (so it can easily be copied to multiple clients) this can let you process your dataset much more quickly than any single machine could do it.
This is very similar to a Google interview question that goes something like "say you have a list on one-million 32-bit integers that you want to print in ascending order, and the machine you are working on only has 2 MB of RAM, how would you approach the problem?".
The answer (or rather, one valid answer) is to break the list up into manageable chunks, sort each chunk, and then apply a merge operation to generate the final sorted list.
So I wonder if a similar approach could work here. As in, starting with the first list, read as much data as you can efficiently work with in memory at once. Sort it, and then write the sorted chunk out to disk. Repeat this until you have processed the entire file, and then merge the chunks to construct a single sorted dataset (this step is optional...you could just do the final comparison using all the sorted chunks from file 1 and all the sorted chunks from file 2).
Repeat the above steps for the second file, and then open your two sorted datasets and read through them one line at a time. If the lines match then advance both to the next line. Otherwise record the difference in your result-set (or output file) and then advance whichever file has the lexicographically "smaller" value to the next line, and repeat.
Not sure how fast it would be, but it's almost certainly faster than doing 26 passes through each file (you've got 1 pass to build the chunks, 1 pass to merge the chunks, and 1 pass to compare the sorted datasets).
That, or use a database.
You should read each file once and save them into a database. Then you can perform whatever analysis you need using database queries. Databases are designed to quickly handle and process large amounts of data like this.
It will still be fairly slow to read all of the data into the database the first time, but you won't have to read the files more than once.
I need to store much strings in RAM. But they do not contain special unicode characters, they all contains only characters from "ISO 8859-1" that is one byte.
Now I could convert every string, store it in memory and convert it back to use it with .Contains() and methods like this, but this would be overhead (in my opinion) and slow.
Is there a string class that is fast and reliable and offers some methods of the original string class like .Contains()?
I need this to store more strings in memory with less RAM used. Or is there an other way to do it?
Update:
Thank you for your comments and your answer.
I have a class that stores string. Then with one method call I need to figure out if I already have that string in memory. I have about 1000 strings to figure out if they are in the list a second. hundred of millions in total.
The average size of the string is about 20 chars. It is really the RAM that cares me.
I even thought about compress some millions of strings and store these packages in memory. But then I need to decompress it every time I need to access the values.
I also tried to use a HashSet, but the needed memory amount was even higher.
I don't need the true value. Just to know if the value is in the list. So if there is a hash-value that can do it, even better. But all I found need more memory than the pure string.
Currently there is no plan for further internationalization. So it is something I would deal with when it is time to :-)
I don't know if using a database would solve it. I don't need to fetch anything, just to know if the value was stored in the class. And I need to do this fast.
It is very unlikely that you will win any significant performance from this. However, if you need to save memory, this strategy may be appropriate.
To convert a string to a byte[] for this purpose, use Encoding.Default.GetBytes()[1].
To convert a byte[] back to a string for display or other string-based processing, use Encoding.Default.GetString().
You can make your code look nicer if you use extension methods defined on string and byte[]. Alternatively, you can wrap the byte[] in a wrapper type and put the methods there. Make this wrapper type a struct, not a class, otherwise it will incur extra heap allocations, which is what you’re trying to avoid.
I want to warn you, though — you are throwing away the ability to have Unicode in your application. You should normally have all alarm bells go off every time you think you need to do this. It is best if you structure your code in such a way that you can easily go back to using string when memory sizes will have gone up and memory consumption stops being an issue.
[1] Encoding.Default returns the current 8-bit codepage of the running operating system. The default for this on English-language Windows is Windows-1252, which is what you want. For Russian Windows it will be Windows-1251 (Cyrillic) etc.
As per comments, a basically bad idea. If you have to do it, byte[] is your friend. There is no byte-oriented string class in .NET.
Checkout the string.Intern method, that could help you out:
http://www.yoda.arachsys.com/csharp/strings.html
http://en.csharp-online.net/CSharp_String_Theory%E2%80%94String_intern_pool
However looking at your requirements, I think you are over engineering it. You have 1000 strings at 20 chars = 1000 * 20 * 2 = 40,000 bytes, that's not much memory.
If you really have a large amount, store it in a DB with an index. That would be much faster than anything the average programmer can come up with.
My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005
Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:
Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)
EDITED: I will keep adding to this to clarify points raised
I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.
The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a varchar(4000) column and use (on average) 300 bytes each.
Then you have up to 200,000 of these to store in a database.
That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.
At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.
And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.
SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.
If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx
As an example, you can create a compressed table like this:
CREATE TABLE T1
(c1 int, c2 nvarchar(50) )
WITH (DATA_COMPRESSION = PAGE);
If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like System.IO.Compression. I suppose you could try it, though.
Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.
Not entirely clear on what you are asking.
In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?
Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.
This would be an alternative to having something like EF generating essentially 200K insert statements.
UPDATE
Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx
I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).
Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.
Sounds like you would benefit from using Large-Value Data Types
These data types will store up to 2^31-1 bytes of data
If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.
It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.