Application
I am working on an MMO and have run into an issue. The MMO server I've built is fast and locally sending messages to my game client every ~50 milliseconds on UDP sockets. Here is an example of a message from sever to client through my current message system:
count={2}body={[t={agt}id={42231}pos={[50.40142117456183,146.3123192153775]}rot={200.0},t={agt}id={4946}pos={[65.83051652925558,495.25839757504866]}rot={187.0}}
count={2}, 2 = number of objects
[,] = array of objects
I built a simple text parser, code: http://tinypaste.com/af3fb928
I use the message like :
int objects = int.Parse(UTL.Parser.DecodeMessage("count", message));
string body = UTL.Parser.DecodeMessage("body", message);
for (int i = 0; i < objects; i++)
{
string objectStr = UTL.Parser.DecodeMessage("[" + i + "]", body);
// parse objecStr with UTL.Parser.DecodeMessage to extract pos & rot and apply to objects
)
Issue
When I have more objects ~ 60+ the performance dramatically decreases.
Question
What is standard method for packaging and reading message between clients and server in MMOs or real-time online games?
A binary protocol would be a lot faster here. Take the coordinates that you pass in for example. When you transport those as bytes, they will take up 8 bytes per axis, whereas the string representation uses 2 bytes per character (unless you transport as ASCII but even then a binary double would be smaller).
Then actually turning that string into a number is a lot more work; first a substring has to be created, incurring garbage collection overhead later on, the number has to be parsed which isn't fast (but to be fair, you can still parse hundreds of thousands of doubles a second) because .NET's double.Parse is very general and has to accommodate a lot of different formats. You would gain noticeable speed by writing your own double parser in case you stick with text, but like I said, you ought to go with binary if message parsing is a bottleneck. Turning bytes into doubles (with a little bit of unsafe magic, which should be fine if you're running a server and should have Full Trust mode) is a matter of copying 8 bytes.
If your messages are mostly just coordinates and such, I think you could gain a lot by creating your own binary format.
One of your problems might be unnecessary string instantiation. Any time you do a concatenation, split, or get a substring, you are creating new string instances on the stack, which are copied from the original string and need to be collected afterwards.
Try to change your code to iterate through the string character by character, and parse the data using the original string only. You should only use indexing, indexOf and possibly even write your own int and float parsers which accept a string offset + length to avoid creating substrings at all. I am not sure if this is overkill, but it's not "premature" optimization if you have hard evidence that it works slow.
Also, did you try Protocol buffers? I believe their performance should be pretty good (just write a small console app for benchmark). Or with JSON, that's a standard concise format (but I have no clue about how optimized Json.NET is). Nothing should usually beat a hard coded specialized parser in terms of performance, but with future maintenance in mind, I would try one of these protocols before anything else.
I'd look into using an RPC framework like Thrift to do the lifting for you. It'll do the packing and parsing for you, and send stuff over the wire in binary so it's more efficient.
There are a bunch of other options, too. Here is a comparison of some.
Related
A WPF .NET 4.5 app that I have been developing, initially to work on small data volumes, now works on much larger data volumes in the region of 1 million and more and of course I started running out of memory. The data comes from a MS SQL DB and data processing needs to be loaded to a local data structure, because this data is then transformed / processed / references by the code in CLR a continuous and uninterrupted data access is required, however not all data has to be loaded into memory straight away, but only when it is actually accessed. As a small example an Inverse Distance Interpolator uses this data to produce interpolated maps and all data needs to be passed to it for a continuous grid generation.
I have re-written some parts of the app for processing data, such as only load x amount of rows at any given time and implement a sliding window approach to data processing which works. However doing this for the rest of the app will require some time investment and I wonder if there can be a more robust and standard way of approaching this design problem (there has to be, I am not the first one)?
tldr; Does C# provide any data structures or techniques for accessing large data amounts in an interrupted manner, so it behaves like a IEnumerable but data is not in memory until it is actually accessed or required, or is it completely up to me to manage memory usage? My ideal would be a structure that would automatically implement a buffer like mechanism and load in more data as of when that data is accessed and freeing memory from the data that has been accessed and no longer of interest. Like some DataTable with an internal buffer maybe?
As far as iterating through a very large data set that is too large to fit in memory goes, you can use a producer-consumer model. I used something like this when I was working with a custom data set that contained billions of records--about 2 terabytes of data total.
The idea is to have a single class that contains both producer and consumer. When you create a new instance of the class, it spins up a producer thread that fills a constrained concurrent queue. And that thread keeps the queue full. The consumer part is the API that lets you get the next record.
You start with a shared concurrent queue. I like the .NET BlockingCollection for this.
Here's an example that reads a text file and maintains a queue of 10,000 text lines.
public class TextFileLineBuffer
{
private const int QueueSize = 10000;
private BlockingCollection<string> _buffer = new BlockingCollection<string>(QueueSize);
private CancellationTokenSource _cancelToken;
private StreamReader reader;
public TextFileLineBuffer(string filename)
{
// File is opened here so that any exception is thrown on the calling thread.
_reader = new StreamReader(filename);
_cancelToken = new CancellationTokenSource();
// start task that reads the file
Task.Factory.StartNew(ProcessFile, TaskCreationOptions.LongRunning);
}
public string GetNextLine()
{
if (_buffer.IsCompleted)
{
// The buffer is empty because the file has been read
// and all lines returned.
// You can either call this an error and throw an exception,
// or you can return null.
return null;
}
// If there is a record in the buffer, it is returned immediately.
// Otherwise, Take does a non-busy wait.
// You might want to catch the OperationCancelledException here and return null
// rather than letting the exception escape.
return _buffer.Take(_cancelToken.Token);
}
private void ProcessFile()
{
while (!_reader.EndOfStream && !_cancelToken.Token.IsCancellationRequested)
{
var line = _reader.ReadLine();
try
{
// This will block if the buffer already contains QueueSize records.
// As soon as a space becomes available, this will add the record
// to the buffer.
_buffer.Add(line, _cancelToken.Token);
}
catch (OperationCancelledException)
{
;
}
}
_buffer.CompleteAdding();
}
public void Cancel()
{
_cancelToken.Cancel();
}
}
That's the bare bones of it. You'll want to add a Dispose method that will make sure that the thread is terminated and that the file is closed.
I've used this basic approach to good effect in many different programs. You'll have to do some analysis and testing to determine the optimum buffer size for your application. You want something large enough to keep up with the normal data flow and also handle bursts of activity, but not so large that it exceeds your memory budget.
IEnumerable modifications
If you want to support IEnumerable<T>, you have to make some minor modifications. I'll extend my example to support IEnumerable<String>.
First, you have to change the class declaration:
public class TextFileLineBuffer: IEnumerable<string>
Then, you have to implement GetEnumerator:
public IEnumerator<String> GetEnumerator()
{
foreach (var s in _buffer.GetConsumingEnumerable())
{
yield return s;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
With that, you can initialize the thing and then pass it to any code that expects an IEnumerable<string>. So it becomes:
var items = new TextFileLineBuffer(filename);
DoSomething(items);
void DoSomething(IEnumerable<string> list)
{
foreach (var s in list)
Console.WriteLine(s);
}
#Sergey The producer-consumer model is probably your safest solution (Proposed by Jim Mischel) for complete scalability.
However, if you were to increase the room for the elephant (using your visual metaphor that fits very well), then compression on the fly is a viable option. Decompress when used and discard after use, leaving the core data structure compressed in memory. Obviously it depends on the data - how much it lends itself to compression, but there is a hell of alot of room in most data structures. If you have ON and OFF flags for some meta data, this can be buried in the unused bits of 16/32 bit numbers, or at least held in bits not bytes; use 16 bit integers for lat / longs with a constant scaling factor to convert each to real numbers before use; strings can be compressed using winzip type libraries - or indexed so that only ONE copy is held and no duplicates exist in memory, etc....
Decompression (albeit custom made) on the fly can be lightning fast.
This whole process can be very laborious I admit, but can definitely keep the room large enough as the elephant grows - in some instances. (Of course, it may never be good enough if the data is simply growing indefinitely)
EDIT: Re any sources...
Hi #Sergey, I wish I could!! Truly! I have used this technique for data compression and really the whole thing was custom designed on a whiteboard with one or two coders involved.
Its certainly not (all) rocket science, but its good to fully scope out the nature of all the data, then you know (for example) that a certain figure will never exceed 9999, so then you can choose how to store it in minimum bits, and then allocate the left over bits (assuming 32 bit storage) to other values. (A real world example is the number of fingers a person has...loosely speaking you could set an upper limit at 8 or 10, although 12 is possible, and even 20 is remotely feasible, etc if they have extra fingers. You can see what I mean) Lat / Longs are the PERFECT example of numbers that will never cross logical boundaries (unless you use wrap around values...). That is, they are always in between -90 and +90 (just guessing which type of Lat Longs) - which is very easy to reduce / convert as the range of values is so neat.
So we did not rely 'directly' on any third party literature. Only upon algorithms designed for specific types of data.
In other projects, for fast real time DSP (processing) the smarter (experienced game programmers) coders would convert floats to 16 bit ints and have a global scaling factor calculated to give max precision for the particular data stream (accelerometers, LVDT, Pressure gauges, etc) you are collecting.
This reduced the transmitted AND stored data without losing ANY information. Similarly, for real time wave / signal data you could use (Fast) Fourier Transform to turn your noisy wave into its Amplitude, Phase and Spectrum components - literally half of the data values, without actually losing any (significant) data. (Within these algorithms, the data 'loss' is completely measurable - so you can decide if you are in fact losing data)
Similarly there are algorithms like Rainfall Analysis (nothing to do with rain, more about cycles and frequency) which reduces your data alot. Peak detection and vector analysis can be enough for some other signals, which basically throws out about 99% of the data...The list is endless, but the technique MUST be intimately suited to your data. And you may have many different types of data, each lending itself to a different 'reduction' technique. I'm sure you can google 'lossless data reduction' (although I think the term lossless is coined by music processing and a little misleading since digital music has already lost the upper and lower freq ranges...I digress)....Please post what you find (if of course you have the time / inclination to research this further)
I would be interested to discuss your meta data, perhaps a large chunk can be 'reduced' quite elegantly...
as stated in the title i am evaluating the cost of implement a BitArray over bytes[] (i have understood that native BitArray is pretty slow) insthead of using a string representation of bits (eg : "001001001" ) but i am open to any suggestion that are more effective.
The length of array is not known at design time, but i suppose may be between 200 and 500 bit per array.
Memory is not a concern, so use a lot of memory for represent the array is not an issue, what matter is speed when array is created and manupulated (thiy will be manipulated a lot).
Thanks in advance for yours consideration and suggenstion onto the topic.
A few suggestions:
1) Computers don't process bits o even n int or long will work at the same speed
2) To reach speed you can consider writing it with unsafe code
3) New is expensive. If the objects are created a lot you can do the following: Create a bulk of 10K
objects at a time and serve them from a method when required. Once the cache runs out you can recreate them. Have another method that once an object processing completes you clean it up and return it to the cache
4) Make sure your manipulation is optimal
I know this question has been done but I have a slightly different twist to it. Several have pointed out that this is premature optimization, which is entirely true if I were asking for practicality's sake and practicality's sake only. My problem is rooted in a practical problem but I'm still curious nonetheless.
I'm creating a bunch of SQL statements to create a script (as in it will be saved to disk) to recreate a database schema (easily many many hundreds of tables, views, etc.). This means my string concatenation is append-only. StringBuilder, according to MSDN, works by keeping an internal buffer (surely a char[]) and copying string characters into it and reallocating the array as necessary.
However, my code has a lot of repeat strings ("CREATE TABLE [", "GO\n", etc.) which means I can take advantage of them being interned but not if I use StringBuilder since they would be copied each time. The only variables are essentially table names and such that already exist as strings in other objects that are already in memory.
So as far as I can tell that after my data is read in and my objects created that hold the schema information then all my string information can be reused by interning, yes?
Assuming that, then wouldn't a List or LinkedList of strings be faster because they retain pointers to interned strings? Then it's only one call to String.Concat() for a single memory allocation of the whole string that is exactly the correct length.
A List would have to reallocate string[] of interned pointers and a linked list would have to create nodes and modify pointers, so they aren't "free" to do but if I'm concatenating many thousands of interned strings then they would seem like they would be more efficient.
Now I suppose I could come up with some heuristic on character counts for each SQL statement & count each type and get a rough idea and pre-set my StringBuilder capacity to avoid reallocating its char[] but I would have to overshoot by a fair margin to reduce the probability of reallocating.
So for this case, which would be fastest to get a single concatenated string:
StringBuilder
List<string> of interned strings
LinkedList<string> of interned strings
StringBuilder with a capacity heuristic
Something else?
As a separate question (I may not always go to disk) to the above: would a single StreamWriter to an output file be faster yet? Alternatively, use a List or LinkedList then write them to a file from the list instead of first concatenating in memory.
EDIT:
As requested, the reference (.NET 3.5) to MSDN. It says: "New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer." That to me means a char[] that is realloced to make it larger (which requires copying old data to the resized array) then appending.
For your separate question, Win32 has a WriteFileGather function, which could efficiently write a list of (interned) strings to disk - but it would make a notable difference only when being called asynchronously, as the disk write will overshadow all but extremely large concatenations.
For your main question: unless you are reaching megabytes of script, or tens of thousands of scripts, don't worry.
You can expect StringBuilder to double the allocation size on each reallocation. That would mean growing a buffer from 256 bytes to 1MB is just 12 reallocations - quite good, given that your initial estimate was 3 orders of magnitude off the target.
Purely as an exercise, some estimates: building a buffer of 1MB will sweep roughly 3 MB memory (1MB source, 1MB target, 1MB due to
copying during realloation).
A linked list implementation will sweep about 2MB, (and that's ignoring the 8 byte / object overhead per string reference). So you are saving 1 MB memory reads/writes, compared to a typical memory bandwidth of 10Gbit/s and 1MB L2 cache.)
Yes, a list implementation is potentially faster, and the difference would matter if your buffers are an order of magnitude larger.
For the much more common case of small strings, the algorithmic gain is negligible, and easily offset by other factors: the StringBuilder code is likely in the code cache already, and a viable target for microoptimizations. Also, using a string internally means no copy at all if the final string fits the initial buffer.
Using a linked list will also bring down the reallocation problem from O(number of characters) to O(number of segments) - your list of string references faces the same problem as a string of characters!
So, IMO the implementation of StringBuilder is the right choice, optimized for the common case, and degrades mostly for unexpectedly large target buffers. I'd expect a list implementation to degrade for very many small segments first, which is actually the extreme kind of scenario StringBuilder is trying to optimize for.
Still, it would be interesting to see a comparison of the two ideas, and when the list starts to be faster.
If I were implementing something like this, I would never build a StringBuilder (or any other in memory buffer of your script).
I would just stream it out to your file instead, and make all strings inline.
Here's an example pseudo code (not syntactically correct or anything):
FileStream f = new FileStream("yourscript.sql");
foreach (Table t in myTables)
{
f.write("CREATE TABLE [");
f.write(t.ToString());
f.write("]");
....
}
Then, you'll never need an in memory representation of your script, with all the copying of strings.
Opinions?
In my experience, I properly allocated StringBuilder outperforms most everything else for large amounts of string data. It's worth wasting some memory, even, by overshooting your estimate by 20% or 30% in order to prevent reallocation. I don't currently have hard numbers to back it up using my own data, but take a look at this page for more.
However, as Jeff is fond of pointing out, don't prematurely optimize!
EDIT: As #Colin Burnett pointed out, the tests that Jeff conducted don't agree with Brian's tests, but the point of linking Jeff's post was about premature optimization in general. Several commenters on Jeff's page noted issues with his tests.
Actually StringBuilder uses an instance of String internally. String is in fact mutable within the System assembly, which is why StringBuilder can be build on top of it. You can make StringBuilder a wee bit more effective by assigning a reasonable length when you create the instance. That way you will eliminate/reduce the number of resize operations.
String interning works for strings that can be identified at compile time. Thus if you generate a lot of strings during the execution they will not be interned unless you do so yourself by calling the interning method on string.
Interning will only benefit you if your strings are identical. Almost identical strings doesn't benefit from interning, so "SOMESTRINGA" and "SOMESTRINGB" will be two different strings even if they are interned.
If all (or most) of the strings being concatenated are interned, then your scheme MIGHT give you a performance boost, as it could potentally use less memory, and could save a few large string copies.
However, whether or not it actually improves perf depends on the volume of data you are processing, because the improvement is in constant factors, not in the order of magnitude of the algorithm.
The only way to really tell is to run your app using both ways and measure the results. However, unless you are under significant memory pressure, and need a way to save bytes, I wouldn't bother and would just use string builder.
A StringBuilder doesn't use a char[] to store the data, it uses an internal mutable string. That means that there is no extra step to create the final string as it is when you concatenate a list of strings, the StringBuilder just returns the internal string buffer as a regular string.
The reallocations that the StringBuilder does to increase the capacity means that the data is by average copied an extra 1.33 times. If you can provide a good estimate on the size when you create the StringBuilder you can reduce that even furter.
However, to get a bit of perspective, you should look at what it is that you are trying to optimise. What will take most of the time in your program is to actually write the data to disk, so even if you can optimise your string handling to be twice as fast as using a StringBuilder (which is very unlikely), the overall difference will still only be a few percent.
Have you considered C++ for this? Is there a library class that already builds T/SQL expressions, preferably written in C++.
Slowest thing about strings is malloc. It takes 4KB per string on 32-bit platforms. Consider optimizing number of string objects created.
If you must use C#, I'd recommend something like this:
string varString1 = tableName;
string varString2 = tableName;
StringBuilder sb1 = new StringBuilder("const expression");
sb1.Append(varString1);
StringBuilder sb2 = new StringBuilder("const expression");
sb2.Append(varString2);
string resultingString = sb1.ToString() + sb2.ToString();
I would even go as far as letting the computer evaluate the best path for object instantiation with dependency injection frameworks, if perf is THAT important.
One of our staff members has lost his mailbox but luckily has a dump of his email in mbox format. I need to somehow get all the messages inside the mbox file and squirt them into our tech support database (as its a custom tool there are no import tools available).
I've found SharpMimeTools which breaks down a message but not allow you to iterate through a bunch of messages in a mbox file.
Does anyone know of a decent parser thats open without having to learn the RFC to write one out?
I'm working on a MIME & mbox parser in C# called MimeKit.
It's based on earlier MIME & mbox parsers I've written (such as GMime) which were insanely fast (could parse every message in an 1.2GB mbox file in about 1 second).
I haven't tested MimeKit for performance yet, but I am using many of the same techniques in C# that I used in C. I suspect it'll be slower than my C implementation, but since the bottleneck is I/O and MimeKit is written to do optimal (4k) reads like GMime is, they should be pretty close.
The reasons you are finding your current approach to be slow (StreamReader.ReadLine(), combining the text, then passing it off to SharpMimeTools) are because of the following reasons:
StreamReader.ReadLine() is not a very optimal way of reading data from a file. While I'm sure StreamReader() does internal buffering, it needs to do the following steps:
A) Convert the block of bytes read from the file into unicode (this requires iterating over the bytes in the byte[] read from disk to convert the bytes read from the stream into a unicode char[]).
B) Then it needs to iterate over its internal char[], copying each char into a StringBuilder until it finds a '\n'.
So right there, with just reading lines, you have at least 2 passes over your mbox input stream. Not to mention all of the memory allocations going on...
Then you combine all of the lines you've read into a single mega-string. This requires another pass over your input (copying every char from each string read from ReadLine() into a StringBuilder, presumably?).
We are now up to 3 iterations over the input text and no parsing has even happened yet.
Now you hand off your mega-string to SharpMimeTools which uses a SharpMimeMessageStream which... (/facepalm) is a ReadLine()-based parser that sits on top of another StreamReader that does charset conversion. That makes 5 iterations before anything at all is even parsed. SharpMimeMessageStream also has a way to "undo" a ReadLine() if it discovers it has read too far. So it is reasonable to assume that he is scanning over some of those lines at least twice. Not to mention all of the string allocations going on... ugh.
For each header, once SharpMimeTools has its line buffer, it splits into field & value. That's another pass. We are up to 6 passes so far.
SharpMimeTools then uses string.Split() (which is a pretty good indication that this mime parser is not standards compliant) to tokenize address headers by splitting on ',' and parameterized headers (such as Content-Type and Content-Disposition) by splitting on ';'. That's another pass. (We are now up to 7 passes.)
Once it splits those it runs a regex match on each string returned from the string.Split() and then more regex passes per rfc2047 encoded-word token before finally making another pass over the encoded-word charset and payload components. We're talking at least 9 or 10 passes over much of the input by this point.
I give up going any farther with my examination because it's already more than 2x as many passes as GMime and MimeKit need and I know my parsers could be optimized to make at least 1 less pass than they do.
Also, as a side-note, any MIME parser that parses strings instead of byte[] (or sbyte[]) is never going to be very good. The problem with email is that so many mail clients/scripts/etc in the wild will send undeclared 8bit text in headers and message bodies. How can a unicode string parser possibly handle that? Hint: it can't.
using (var stream = File.OpenRead ("Inbox.mbox")) {
var parser = new MimeParser (stream, MimeFormat.Mbox);
while (!parser.IsEndOfStream) {
var message = parser.ParseMessage ();
// At this point, you can do whatever you want with the message.
// As an example, you could save it to a separate file based on
// the message subject:
message.WriteTo (message.Subject + ".eml");
// You also have the ability to get access to the mbox marker:
var marker = parser.MboxMarker;
// You can also get the exact byte offset in the stream where the
// mbox marker was found:
var offset = parser.MboxMarkerOffset;
}
}
2013-09-18 Update: I've gotten MimeKit to the point where it is now usable for parsing mbox files and have successfully managed to work out the kinks, but it's not nearly as fast as my C library. This was tested on an iMac so I/O performance is not as good as it would be on my old Linux machine (which is where GMime is able to parse similar sized mbox files in ~1s):
[fejj#localhost MimeKit]$ mono ./mbox-parser.exe larger.mbox
Parsed 14896 messages in 6.16 seconds.
[fejj#localhost MimeKit]$ ./gmime-mbox-parser larger.mbox
Parsed 14896 messages in 3.78 seconds.
[fejj#localhost MimeKit]$ ls -l larger.mbox
-rw-r--r-- 1 fejj staff 1032555628 Sep 18 12:43 larger.mbox
As you can see, GMime is still quite a bit faster, but I have some ideas on how to improve the performance of MimeKit's parser. It turns out that C#'s fixed statements are quite expensive, so I need to rework my usage of them. For example, a simple optimization I did yesterday shaved about 2-3s from the overall time (if I remember correctly).
Optimization Update: Just improved performance by another 20% by replacing:
while (*inptr != (byte) '\n')
inptr++;
with:
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
Optimization Update: I was able to finally make MimeKit as fast as GMime by switching away from my use of Enum.HasFlag() and using direct bit masking instead.
MimeKit can now parse the same mbox stream in 3.78s.
For comparison, SharpMimeTools takes more than 20 minutes (to test this, I had to split the emails apart into separate files because SharpMimeTools can't parse mbox files).
Another Update: I've gotten it down to 3.00s flat via various other tweaks throughout the code.
I don't know any parser, but mbox is really a very simple format. A new email begins on lines starting with "From " (From+Space) and an empty line is attached to the end of each mail. Should there be any occurence of "From " at the beginning of a line in the email itself, this is quoted out (by prepending a '>').
Also see Wikipedia's entry on the topic.
If you can stretch to using Python, there is one in the standard library. I'm unable to find any for .NET sadly.
To read .mbox files, you can use a third-party library Aspose.Email.
This library is a complete set of Email Processing APIs to build cross-platform applications having the ability to create, manipulate, convert, and transmit emails without using Microsoft Outlook.
Please, take a look at the example I have provided below.
using(FileStream stream = new FileStream("ExampleMbox.mbox", FileMode.Open, FileAccess.Read))
{
using(MboxrdStorageReader reader = new MboxrdStorageReader(stream, false))
{
// Start reading messages
MailMessage message = reader.ReadNextMessage();
// Read all messages in a loop
while (message != null)
{
// Manipulate message - show contents
Console.WriteLine("Subject: " + message.Subject);
// Save this message in EML or MSG format
message.Save(message.Subject + ".eml", SaveOptions.DefaultEml);
message.Save(message.Subject + ".msg", SaveOptions.DefaultMsgUnicode);
// Get the next message
message = reader.ReadNextMessage();
}
}
}
It is easy to use. I hope this approach will satisfy you and other searchers.
I am working as a Developer Evangelist at Aspose.
What is it about the way strings are implemented that makes them so expensive to manipulate?
Is it impossible to make a "cheap" string implementation?
or am I completely wrong in my understanding?
Thanks
Which language?
Strings are typically immutable, meaning that any change to the data results in a new copy of the string being created. This can have a performance impact with large strings.
This is an important feature, however, because it allows for optimizations such as interning. Interning reduces the size of text data by pointing identical strings to the same copy of data.
If you are concerned about performance with strings, use a StringBuilder (available in C# and Java) or another construct that works with mutable text data.
If you are working with a large amount of text data and need a powerful string solution while still saving space, look into using ropes.
The problem with strings is that they are not primitive types. They are arrays.
Therefore, they suffer the same speed and memory problems as arrays(with a few optimizations, perhaps).
Now, "cheap" implementations would require lots of stuff: concatenation, indexOf, etc.
There are many ways to do this. You can improve the implementation, but there are some limits. Because strings are not "natural" for computers, they need more memory and are slower to manipulate... ALWAYS. You'll never get a string concatenation algorithm faster than any decent integer sum algorithm.
Since it creates new copy of the object every time in java its advisable to use StringBuffer
Syntax
StringBuffer strBuff=new StringBuffer();
strBuff.append("StringBuffer");
strBuff.append("is");
strBuff.append("more");
strBuff.append("economical");
strBuff.append("than");
strBuff.append("String");
String string=strBuff.tostring();
Many of the points here are well taken. In isolated cases you may be able to cheat and do thing like using a 64bit int to compare 8 bytes at time in a string, but there are not a lot of generalized cases where you can optimize operations. If you have "pascal style" string with a numeric length field compares can be short circuited logic to only check the rest of the string if the length is not the same. Other operations typically require you to handle the characters a byte at time or completely copy them when you use them.
i.e. concatenation => get length of string 1, get length of string 2, allocated memory, copy string 1, copy string 2. It would be possible to do operations like this using a DMA controller in a string libary, but the overhead of setting it up for small strings would outweigh the benefits.
Pete
It depends entirely on what you're trying to do with it. Mostly it's that it usually requires at least 1 new array allocation unless it's replacing a single character in a direct seek. At the simplest level a string is an array of chars. So just about anything you want to do involves iterating, removing, or inserting new things into an array.
Look into mutable strings, immutable strings, and ropes, and think about how you would implement common operations in a low-level language (say, C). Consider:
Concatenation.
Slicing.
Getting a character at an index.
Changing a character at an index.
Locating the index of a character.
Traversing the string.
Coming up with algorithms for these situations will give you a feel for when each type of storage would be appropriate.
If you want a universal string working in every condition, you have to sacrifice efficiency in some cases. This is a classic tradeoff between getting one thing fast and another. So... either you use a "standard" string working properly (but not in an optimal way), or a string implementation which is very fast in some cases and cumbersome in other.
Sometimes you need immutability, sometimes random access, sometimes quick insertions/deletions...
Changes and copying of strings tends to involve memory management.
Memory management is not good for performance since it tends to require some kind of global mutex that makes your code scale poorly to multiple cores.
You want to read this Joel Spolsky article:
http://www.joelonsoftware.com/articles/fog0000000319.html
Me, I'm disappointed .NET doesn't have a native type called F***edString.