basic C# hex editor performance issue [duplicate] - c#

This question already has answers here:
Performance issues with nested loops and string concatenations
(8 answers)
Closed 8 years ago.
im having an issue with the speed of a simple hex editor i was working on
im using a background worker, simple for/foreach loop and couple simple statements but it's still way way slower than modern hex editors
that's the main loop that is taking too long to finish
for (int i = 0; i < buffer.Count() - 1; i++)
{
string hex = Convert.ToString(buffer[i], 16);
hexstring += ((hex.Length == 1 ? hex = "0" + hex : hex = hex)) + " ";
double x = ((double)i/(double)buffer.Count());
bw.ReportProgress((int)(x * 100));
}
i know this could be written a million times better but im so curious what's causing this delay
a 1 mb exe. would take 5 mins of +50% cpu usage and this is far from being accepted, any thoughts ?
edit 1 : buffer is only a byte[], here is it's other only usage
buffer = File.ReadAllBytes(((string[]) e.Data.GetData(DataFormats.FileDrop, false))[0]);

I hate to be "that guy" in this case, but you're reinventing a built-in wheel. There is a function in .NET which enables converting a byte array to a hex string. All you need is love, err this:
string hex = BitConverter.ToString(buffer);
I suppose this doesn't answer your question of why your solution is slow. Your solution is primarily slow because of string immutability. Strings are immutable (read-only) and when you concatenate them (AKA combine them with + or += operators) you create a new object. You're creating 3, sometimes 4 strings per loop, which is not cheap since they take up memory and the garbage collector has to eventually collect them. You can avoid this by using a StringBuilder which floats a buffer under the hood when appending strings (vs creating new ones). Also, if the buffer is large, it's going to take a while - sort of the nature of the beast (more operations take longer). Hope this helps!

The reason is your use of the += operator to concatenate strings.
Each time you do that, it will copy all the previous content of the string and the added content into a new string. Each time there will be more and more data to move. At the end of the loop it will move 6 MB of data each iteration.
When you are done with creating the string for the 1 MB of data, you will have copied 3 TB of data. That is a little more than there is available RAM, so a whole bunch of garbage collections also had to be done to clean up old strings and make room for new ones.
If you use a StringBuilder instead, you will see a dramatic change in performance.
Next thing to improve would be to report the progress a little less often. You could for example do that for every kilobyte processed instead of every byte.

Related

Time complexity for C# StringBuilder initialized with string [duplicate]

How does StringBuilder work?
What does it do internally? Does it use unsafe code?
And why is it so fast (compared to the + operator)?
When you use the + operator to build up a string:
string s = "01";
s += "02";
s += "03";
s += "04";
then on the first concatenation we make a new string of length four and copy "01" and "02" into it -- four characters are copied. On the second concatenation we make a new string of length six and copy "0102" and "03" into it -- six characters are copied. On the third concat, we make a string of length eight and copy "010203" and "04" into it -- eight characters are copied. So far a total of 4 + 6 + 8 = 18 characters have been copied for this eight-character string. Keep going.
...
s += "99";
On the 98th concat we make a string of length 198 and copy "010203...98" and "99" into it. That gives us a total of 4 + 6 + 8 + ... + 198 = a lot, in order to make this 198 character string.
A string builder doesn't do all that copying. Rather, it maintains a mutable array that is hoped to be larger than the final string, and stuffs new things into the array as necessary.
What happens when the guess is wrong and the array gets full? There are two strategies. In the previous version of the framework, the string builder reallocated and copied the array when it got full, and doubled its size. In the new implementation, the string builder maintains a linked list of relatively small arrays, and appends a new array onto the end of the list when the old one gets full.
Also, as you have conjectured, the string builder can do tricks with "unsafe" code to improve its performance. For example, the code which writes the new data into the array can already have checked that the array write is going to be within bounds. By turning off the safety system it can avoid the per-write check that the jitter might otherwise insert to verify that every write to the array is safe. The string builder does a number of these sorts of tricks to do things like ensuring that buffers are reused rather than reallocated, ensuring that unnecessary safety checks are avoided, and so on. I recommend against these sorts of shenanigans unless you are really good at writing unsafe code correctly, and really do need to eke out every last bit of performance.
StringBuilder's implementation has changed between versions, I believe. Fundamentally though, it maintains a mutable structure of some form. I believe it used to use a string which was still being mutated (using internal methods) and would just make sure it would never be mutated after it was returned.
The reason StringBuilder is faster than using string concatenation in a loop is precisely because of the mutability - it doesn't require a new string to be constructed after each mutation, which would mean copying all the data within the string etc.
For just a single concatenation, it's actually slightly more efficient to use + than to use StringBuilder. It's only when you're performing multiple operations and you don't really need the intermediate results that StringBuilder shines.
See my article on StringBuilder for more information.
The Microsoft CLR does do some operations with internal call (not quite the same as unsafe code). The biggest performance benefit over a bunch of + concatenated strings is that it writes to a char[] and doesn't create as many intermediate strings. When you call ToString (), it builds a completed, immutable string from your contents.
The StringBuilder uses a string buffer that can be altered, compared to a regular String that can't be. When you call the ToString method of the StringBuilder it will just freeze the string buffer and convert it into a regular string, so it doesn't have to copy all the data one extra time.
As the StringBuilder can alter the string buffer, it doesn't have to create a new string value for each and every change to the string data. When you use the + operator, the compiler turns that into a String.Concat call that creates a new string object. This seemingly innocent piece of code:
str += ",";
compiles into this:
str = String.Concat(str, ",");

Can you speed up this algorithm? C# / C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hey I've been working on something from time to time and it has become relatively large now (and slow). However I managed to pinpoint the bottleneck after close up measuring of performance in function of time.
Say I want to "permute" the string "ABC". What I mean by "permute" is not quite a permutation but rather a continuous substring set following this pattern:
A
AB
ABC
B
BC
C
I have to check for every substring if it is contained within another string S2 so I've done some quick'n dirty literal implementation as follows:
for (int i = 0; i <= strlen1; i++)
{
for (int j = 0; j <= strlen2- i; j++)
{
sub = str1.Substring(i, j);
if (str2.Contains(sub)) {do stuff}
else break;
This was very slow initially but once I realised that if the first part doesnt exist, there is no need to check for the subsequent ones meaning that if sub isn't contained within str2, i can call break on the inner loop.
Ok this gave blazing fast results but calculating my algorithm complexity I realised that in worst case this will be N^4 ? I forgot that str.contains() and str.substr() both have their own complexities (N or N^2 I forgot which).
The fact that I have a huge amount of calls on those inside a 2nd for loop makes it perform rather.. well N^4 ~ said enough.
However I calculated the average run-time of this both mathematically using probability theory to evaluate the probability of growth of the substring in a pool of randomly generated strings (this was my base line) measuring when the probability became > 0.5 (50%)
This showed an exponential relationship between the number of different characters and the string length (roughly) which means that in the scenarios I use my algorithm the length of string1 wont (most probably) never exceed 7
Thus the average complexity would be ~O(N * M) where N is string length1 and M is string length 2. Due to the fact that I've tested N in function of constant M, I've gotten linear growth ~O(N) (not bad opposing to the N^4 eh?)
I did time testing and plotted a graph which showed nearly perfect linear growth so I got my actual results matching my mathematical predictions (yay!)
However, this was NOT taking into account the cost of string.contains() and string.substring() which made me wonder if this could be optimized even further?
I've been also thinking of making this in C++ because I need rather low-level stuff? What do you guys think? I have put a great time into analysing this hope I've elaborated everything clear enough :)!
Your question is tagged both C++ and C#.
In C++ the optimal solution will be to use iterators, and std::search. The original strings remains unmodified, and no intermediate objects get created. There won't be an equivalent of your Substring() taking place at all, so this eliminates that part of the overhead.
This should achieve the theoretically-best performance: brute force search, testing all permutations, with no intermediate object construction or destruction, other than the iterators themselves, which simply replace your two int index variables. I can't think of any faster way of implementing this basic algorithm.
Are You testing one string against one string? If You test bunch of strings against another bunch of strings, it is a whole different story. Even if You have the best algorithm for comparing one string against another O(X), it does not mean repeating it M*N times You would get the best algorithm for processing M strings against N.
When I made something simmiliar, I built dictionary of all substrings of all N strings
Dictionary<string, List<int>>
The string is a substring and int is index of string that contains that substring. Then I tested all substrings of all M strings against it. The speed was suddenly not O(M*N*X), but O(max(M,N)*S), where S is number of substrings of one string. Depending on M, N, X, S that may be faster. I do not say the dictionary of substrings is the best approach, I just want to point out that You should always try to see the whole picture.

Use the += operator to concat strings [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
String vs StringBuilder
Why should I not use += to concat strings?
What is the quickest alternative?
Strings are immutable in .NET.. which means once they exist, they cannot be changed.
The StringBuilder is designed to mitigate this issue, by allowing you to append to a pre-determined character array of n size (default is 16 I think?!). However, once the StringBuilder exceeds the specified limit.. it needs to allocate a bigger copy of itself, and copy the content into it.. thus creating a possibly bigger problem.
What this boils down to is premature optimization. Unless you're noticing issues with string concatenation's using too much memory.. worrying about it is useless.
+= and String1 = String1+String2 do the same thing, copying the whole Strings to a new one.
if you do this in a loop, lots of memoryallocations are generated, leading to poor performance.
if you want to build long strings, you should look into the StringBuilder Class wich is optimized for such operations.
in short: a few concat strings shouldn't hurt performance much, but building a large String by adding small bits in a loop will slow you down a lot and/or will use lots of memory.
Another interesting article on String performance: http://www.codeproject.com/Articles/3377/Strings-UNDOCUMENTED

Why does appending to TextBox.Text during a loop take up more memory with each iteration?

Short Question
I have a loop that runs 180,000 times. At the end of each iteration it is supposed to append the results to a TextBox, which is updated real-time.
Using MyTextBox.Text += someValue is causing the application to eat huge amounts of memory, and it runs out of available memory after a few thousand records.
Is there a more efficient way of appending text to a TextBox.Text 180,000 times?
Edit I really don't care about the result of this specific case, however I want to know why this seems to be a memory hog, and if there is a more efficient way to append text to a TextBox.
Long (Original) Question
I have a small app which reads a list of ID numbers in a CSV file and generates a PDF report for each one. After each pdf file is generated, the ResultsTextBox.Text gets appended with the ID Number of the report that got processed and that it was successfully processed. The process runs on a background thread, so the ResultsTextBox gets updated real-time as items get processed
I am currently running the app against 180,000 ID numbers, however the memory the application is taking up is growing exponentially as time goes by. It starts by around 90K, but by about 3000 records it is taking up roughly 250MB and by 4000 records the application is taking up about 500 MB of memory.
If I comment out the update to the Results TextBox, the memory stays relatively stationary at roughly 90K, so I can assume that writing ResultsText.Text += someValue is what is causing it to eat memory.
My question is, why is this? What is a better way of appending data to a TextBox.Text that doesn't eat memory?
My code looks like this:
try
{
report.SetParameterValue("Id", id);
report.ExportToDisk(ExportFormatType.PortableDocFormat,
string.Format(#"{0}\{1}.pdf", new object[] { outputLocation, id}));
// ResultsText.Text += string.Format("Exported {0}\r\n", id);
}
catch (Exception ex)
{
ErrorsText.Text += string.Format("Failed to export {0}: {1}\r\n",
new object[] { id, ex.Message });
}
It should also be worth mentioning that the app is a one-time thing and it doesn't matter that it is going to take a few hours (or days :)) to generate all the reports. My main concern is that if it hits the system memory limit, it will stop running.
I'm fine with leaving the line updating the Results TextBox commented out to run this thing, but I would like to know if there is a more memory efficient way of appending data to a TextBox.Text for future projects.
I suspect the reason the memory usage is so large is because textboxes maintain a stack so that the user can undo/redo text. That feature doesn't seem to be required in your case, so try setting IsUndoEnabled to false.
Use TextBox.AppendText(someValue) instead of TextBox.Text += someValue. It's easy to miss since it's on TextBox, not TextBox.Text. Like StringBuilder, this will avoid creating copies of the entire text each time you add something.
It would be interesting to see how this compares to the IsUndoEnabled flag from keyboardP's answer.
Don't append directly to the text property. Use a StringBuilder for the appending, then when done, set the .text to the finished string from the stringbuilder
Instead of using a text box I would do the following:
Open up a text file and stream the errors to a log file just in case.
Use a list box control to represent the errors to avoid copying potentially massive strings.
Personally, I always use string.Concat* . I remember reading a question here on Stack Overflow years ago that had profiling statistics comparing the commonly-used methods, and (seem) to recall that string.Concat won out.
Nonetheless, the best I can find is this reference question and this specific String.Format vs. StringBuilder question, which mentions that String.Format uses a StringBuilder internally. This makes me wonder if your memory hog lies elsewhere.
**based on James' comment, I should mention that I never do heavy string formatting, as I focus on web-based development.*
Maybe reconsider the TextBox? A ListBox holding string Items will probably perform better.
But the main problem seem to be the requirements, Showing 180,000 items cannot be aimed at a (human) user, neither is changing it in "Real Time".
The preferable way would be to show a sample of the data or a progress indicator.
When you do want to dump it at the poor User, batch string updates. No user could descern more than 2 or 3 changes per second. So if you produce 100/second, make groups of 50.
Some responses have alluded to it, but nobody has outright stated it which is surprising.
Strings are immutable which means a String cannot be modified after it is created. Therefore, every time you concatenate to an existing String, a new String Object needs to be created. The memory associated with that String Object also obviously needs to be created, which can get expensive as your Strings become larger and larger. In college, I once made the amateur mistake of concatenating Strings in a Java program that did Huffman coding compression. When you're concatenating extremely large amounts of text, String concatenation can really hurt you when you could have simply used StringBuilder, as some in here have mentioned.
Use the StringBuilder as suggested.
Try to estimate the final string size then use that number when instantiating the StringBuilder. StringBuilder sb = new StringBuilder(estSize);
When updating the TextBox just use assignment eg: textbox.text = sb.ToString();
Watch for cross-thread operations as above. However use BeginInvoke. No need to block
the background thread while the UI updates.
A) Intro: already mentioned, use StringBuilder
B) Point: don't update too frequently, i.e.
DateTime dtLastUpdate = DateTime.MinValue;
while (condition)
{
DoSomeWork();
if (DateTime.Now - dtLastUpdate > TimeSpan.FromSeconds(2))
{
_form.Invoke(() => {textBox.Text = myStringBuilder.ToString()});
dtLastUpdate = DateTime.Now;
}
}
C) If that's one-time job, use x64 architecture to stay within 2Gb limit.
StringBuilder in ViewModel will avoid string rebindings mess and bind it to MyTextBox.Text. This scenario will increase performance many times over and decrease memory usage.
Something that has not been mentioned is that even if you're performing the operation in the background thread, the update of the UI element itself HAS to happen on the main thread itself (in WinForms anyway).
When updating your textbox, do you have any code that looks like
if(textbox.dispatcher.checkAccess()){
textbox.text += "whatever";
}else{
textbox.dispatcher.invoke(...);
}
If so, then your background op is definitely being bottlenecked by the UI Update.
I would suggest that your background op use StringBuilder as noted above, but instead of updating the textbox every cycle, try updating it at regular intervals to see if it increases performance for you.
EDIT NOTE:have not used WPF.
You say memory grows exponentially. No, it is a quadratic growth, i.e. a polynomial growth, which is not as dramatic as an exponential growth.
You are creating strings holding the following number of items:
1 + 2 + 3 + 4 + 5 ... + n = (n^2 + n) /2.
With n = 180,000 you get total memory allocation for 16,200,090,000 items, i.e. 16.2 billion items! This memory will not be allocated at once, but it is a lot of cleanup work for the GC (garbage collector)!
Also, bear in mind, that the previous string (which is growing) must be copied into the new string 179,999 times. The total number of copied bytes goes with n^2 as well!
As others have suggested, use a ListBox instead. Here you can append new strings without creating a huge string. A StringBuild does not help, since you want to display the intermediate results as well.

Strings joining and complexity?

When I need to join two strings I use String.Format (or StringBuilder if it happens in several places in the code).
I see that some good programmers doesn't give attention to strings joining complexity and just use the '+' operator.
I know that using the '+' operator make the application to use more memory, but what about complexity?
This is an excellent article about the different string join methods by our own Jeff Atwood on Coding Horror:
(source: codinghorror.com)
The Sad Tragedy of Micro-Optimization Theater
Here is the gist of the post.
[several string join methods shown]
Take your itchy little trigger finger
off that compile key and think about
this for a minute. Which one of these
methods will be faster?
Got an answer? Great!
And.. drumroll please.. the correct
answer:
It. Just. Doesn't. Matter!
This answer assumes you are talking about the runtime complexity.
Using + creates a new string object, which means the contents of both the old string objects must be copied into the new one. With a large amount of concatenation, such as in a tight loop, this can turn into an O(n^2) operation.
As an informal proof, say you had the following code:
string foo = "a";
for(int i = 0; i < 1000; i++)
{
foo += "a";
}
The first iteration of the loop, first the contents of foo ("a") are copied into a new string object, then the contents of the literal "a". That's two copies. The second iteration has three copies; two from the new foo, and one from the literal "a". The 1000th iteration will have 1001 copy operations. The total number of copies is 2 + 3 + ... + 1001. In general, if in a loop you are only concatenating one character each iteration (and you start at one character long), if the number of iterations is n, there will be 2 + 3 + ... + n + 1 copies. That's the same as 1 + 2 + 3 + ... + n = n(n+1)/2 = (n^2 + n)/2, which is O(n^2).
Depends on the situation. The + can sometimes reduce the complexity of the code. Consider the following code:
output = "<p>" + intro + "</p>";
That is a good, clear line. No String.Format required.
If you use + only once, you have no disadvantage from it and it increases readability (as Colin Pickard already stated).
As far as I know + means: take left operand and right operand and copy them into a new buffer (as strings are immutable).
So using + two times (as in Colin Pickards example you already create 2 temporary strings. First when adding "<p>" to intro and then when adding "</p>" to the newly created string.
You have to consider for yourself when to use which method. Even for a small example like seen above performance drop can be serious if intro is a large enough string.
Unless your application is very string-intensive (profile, profile, profile!), this doesn't really matter. Good programmers put readability above performance for mundane operations.
I think in terms of complexity you trade reiteration of newly created strings for parsing format string.
For axample "A" + "B" + "C" + "D" means that you would have to copy "A", "AB", and at last "ABC" in order to form "ABCD". Copying is reiteration, right? So if for example you have a 1000 character string that you will sum with thousand one character strings you will copy (1000+N) character strings 1000 times. It leads to O(n^2) complexity in worst cases.
Strin.Fomat, even considering parsing, and StringBuffer should be around O(n).
Because strings are immutable in languages like Java and C#, everytime two strings are concatenated a new string has to be created, in which the contents of the two old strings are copied.
Assume strings which are on average c characters long.
Now the first concatenation only has to copy 2*c characters, but the last one has to copy the concatenation of the first n-1 strings, which is (n-1)*c characters long, and the last one itself, which is c characters long, for a total of n*c characters. For n concatenations this makes n^2*c/2 character copies, which means an algorithmic complexity of O(n^2).
In most cases in practice however this quadratic complexity will not be noticeable (as Jeff Atwood shows in the blog entry linked to by Robert C. Cartaino) and I'd advise to just write the code as readable as possible.
There are cases however when it does matter, and using O(n^2) in such cases may be deadly.
In practice I've seen this for example for generating big Word XML files in memory, including base64 encoded pictures. This generation used to take over 10 minutes due to using O(n^2) string concatenation. After I replaced concatenation using + with StringBuilder the running time for the same document reduced below 10 seconds.
Similarly I've seen a piece of software that generated an epically big piece of SQL code as a string using + for concatenation. I haven't even waited till this finished (had been waiting for over an hour already), but just rewrote it using StringBuilder. This faster version finished within a minute.
In short, just do whatever is most readable / easiest to write and only think about this when you'll be creating a freaking huge string :-)
StringBuilder should be used if you are building a large string in several steps. It also is a good thing if you know about how large it will be eventually, then you can initialize it with the size you need, and prevent costing re-allocations. For small operations it will not be considerable performance loss using the + operator, and it will result in clearer code (and faster to write...)
Plenty of input already, but I've always felt that the best way to approach the issue of performance is to understand the performance differences of all viable solutions and for those that meet performance requirements, pick the one that is the most reliable and the most supportable.
There are many who use Big O notation to understand complexity, but I've found that in most cases (including understanding which string concatenation methods work best), a simple time trial will suffice. Just compare strA+strB to strA.Append(strB) in a loop of 100,000 iterations to see which works faster.
The compiler optimizes string literal concatenation into one string literal. For example:
string s = "a" + "b" + "c";
is optimized to the following at compile time:
string s = "abc";
See this question and this MSDN article for more information.
Compiler will optimize: "a" + "b" + "c" to be replaced with String.Concat method (not String.Format one as fixed me comments)
I benchmarked this forever ago, and it hasn't really made a differences since .NET 1.0 or 1.1.
Back then if you had some process that was going to hit a line of code that was concatinating strings a few million times you could get a huge speed increase by using String.Concat, String.Format, or StringBuilder.
Now it doesn't matter at all. At least it hasn't mattered since .Net 2.0 came out anyhow. Put it out of your mind and code in whatever manner makes it easiest for you to read.

Categories

Resources