large string data parsing causing high-cpu usage - c#

My application need to parse some large string data. Which means I am heavily using Split, IndexOf and SubString method of string class. I am trying to use StringBuilder class whereever I have to do any concatenation. However when application is doing this parsing, app cpu usage goes high (60-70%). I am guessing that calling these string APIs is what's causing cpu usage to go high, speically the size of data is big (typical Length of string is 400K). Any ideas how can I verify what is causing cpu usage to go that high and also if there are any suggestion on how to bring cpu usage down?

One thing to check is that you are passing the StringBuilder around as much as possible, rather than creating a new one and then returning it's ToString() needlessly.
A much bigger gain though can be made if you process the data as smaller strings, read from a stream. Of course, this depends on just what sort of manipulation you are doing, but if at all possible, read your data from a StreamReader (or similar depending on the source) in small chunks, and then write it to a StreamWriter.
Often changes are only applicable within a given line of text, which makes the following pattern immediately useful:
using(StreamReader sr = new StreamReader(sourceInfo))
using(StreamWriter sw = new StreamWriter(destInfo))
for(string line = sr.ReadLine(); line != null; line = sr.ReadLine())
sw.WriteLine(ManipulateString(line));
In other cases where this doesn't apply, there are still ways to chunk the string to be processed up.

To find out where the CPU usage is coming from: see What Are Some Good .NET Profilers?
To reduce CPU usage: it depends, of course, on what's actually taking the time. You might, for instance, consider working not with actual substrings but with little objects encoding where they are within the big strings they came from. (There is no guarantee that this will actually be an improvement.) Very likely, when you profile your code there will be a few things that jump out at you as problems; they may well be things you'd never have guessed, and they may be very easy to fix as soon as you know they need fixing.

Further to Jon's answer if your parser does not need to do back-tracking i.e. it always reads through the sting in a forward direction and the source of the string is not a file/network stream that you can use a StreamReader with just wrap your String in a StringReader instead e.g.
//Create a StringReader using the String variable data which has your String in it
//A StringReader is just a TextReader implementation for Strings
StringReader reader = new StringReader(data);
//Now do whatever manipulation on the string you want...

In your case you are using typically very large string (Length of string is 400K).. For operations on large string we can use "ROPE" data structure, which is very efficient for your case
Please refer below links for more information
https://iq.opengenus.org/rope-data-structure/
https://www.geeksforgeeks.org/ropes-data-structure-fast-string-concatenation/
STL ropes in c++ : https://www.geeksforgeeks.org/stl-ropes-in-c/

Related

String Builder and string size

Why StringBuilder Size is greater than string(~250MB).
Please read the question. I want to know the reason of size constraint in the string, but not in stringbuilder. I have fixed the problem of reading file.
Yes, I know there are operation, we can perform on string builder like append, replace, remove, etc. But what is the use of it when we can't get ToString() from it and we can't write it directly in the file. We had to get ToString() to actually use it, but because its size is out of string range it throws exception.
So in particular is there any use of string builder having size greated than string as i read a file of around 1 gb into string builder but cant get it into string. I read all the pros and cons of StringBuilder over String but I cant anything explaning this
Update:
I want to load XMLDocument from file if reading in chunk then data cannot be loaded because root level node needs its closing tag which will be in other chunk block
Update:
I know it is not a correct aproach now i am different process but still i want to know the reason of size constraing in string but not in stringbuilder
Update:
I have Fixed my proble and want to know the reason why there is no memory constraint on stringbuilder.
Why StringBuilder Size is greater than string(~250MB).
The reason depends on the version of .net.
There are two implementations Eric Lippert mentions here: https://stackoverflow.com/a/6524401/360211
Internally a string builder maintains a char[]. When you append it may have to resize this array. In order to stop it needing to be resized every time you append it resizes to a larger size to anticipate future appends (it actually doubles in size). So the StringBuilder often ends up larger than it's content, as much as double the size.
A newer implementation maintains a linked list of char[]. If you do many small appends, the overhead of the linked list may account for the extra 250MB.
In normal use, an extra 100% size on a string temporarily doesn't make one bit of difference given the performance benefits, but when you are dealing with a GB, it becomes significant and that is not its intended usage.
Why you get OutOfMemoryException
The linked list implementation can fit more in memory than a string because it does not need one continuous block of 1GB. When you ToString it would force it to try to find another GB, which is also continuous and that is the problem.
Why is there no constraint preventing this?
Well there is. The constraint is if there is not enough memory to create a string during ToString, throw an OutOfMemoryException.
You may want this to happen during Append operations, but that would be impossible to determine. StringBuilder could look at the free memory, but that might change before you call ToString. So the author of StringBuilder could have set an arbitrary limit, but that can't suit all systems equally, as some will have more memory than others.
You also might want to do operations that reduce the size of the StringBuilder before calling ToString, or not call ToString at all! So just because StringBuilder is too large to ToString at any point is not a reason to throw an exception.
You can use StringBuilder.ToString(int, int) to get smaller-sized chunks of your huge content out of of the StringBuilder.
In addition, you might want to consider whether you are really using the right tool for the job. StringBuilder's purpose is to build and modify strings, not to load huge files to memory.
You can try the following to handle large XML files.
CodeProject

Does string.Replace(string, string) create additional strings?

We have a requirement to transform a string containing a date in dd/mm/yyyy format to ddmmyyyy format (In case you want to know why I am storing dates in a string, my software processes bulk transactions files, which is a line based textual file format used by a bank).
And I am currently doing this:
string oldFormat = "01/01/2014";
string newFormat = oldFormat.Replace("/", "");
Sure enough, this converts "01/01/2014" to "01012014". But my question is, does the replace happen in one step, or does it create an intermediate string (e.g.: "0101/2014" or "01/012014")?
Here's the reason why I am asking this:
I am processing transaction files ranging in size from few kilobytes to hundreds of megabytes. So far I have not had a performance/memory problem, because I am still testing with very small files. But when it comes to megabytes I am not sure if I will have problems with these additional strings. I suspect that would be the case because strings are immutable. With millions of records this additional memory consumption will build up considerably.
I am already using StringBuilders for output file creation. And I also know that the discarded strings will be garbage collected (at some point before the end of the time). I was wondering if there is a better, more efficient way of replacing all occurrences of a specific character/substring in a string, that does not additionally create an string.
Sure enough, this converts "01/01/2014" to "01012014". But my question
is, does the replace happen in one step, or does it create an
intermediate string (e.g.: "0101/2014" or "01/012014")?
No, it doesn't create intermediate strings for each replacement. But it does create new string, because, as you already know, strings are immutable.
Why?
There is no reason to a create new string on each replacement - it's very simple to avoid it, and it will give huge performance boost.
If you are very interested, referencesource.microsoft.com and SSCLI2.0 source code will demonstrate this(how-to-see-code-of-method-which-marked-as-methodimploptions-internalcall):
FCIMPL3(Object*, COMString::ReplaceString, StringObject* thisRefUNSAFE,
StringObject* oldValueUNSAFE, StringObject* newValueUNSAFE)
{
// unnecessary code ommited
while (((index=COMStringBuffer::LocalIndexOfString(thisBuffer,oldBuffer,
thisLength,oldLength,index))>-1) && (index<=endIndex-oldLength))
{
replaceIndex[replaceCount++] = index;
index+=oldLength;
}
if (replaceCount != 0)
{
//Calculate the new length of the string and ensure that we have
// sufficent room.
INT64 retValBuffLength = thisLength -
((oldLength - newLength) * (INT64)replaceCount);
gc.retValString = COMString::NewString((INT32)retValBuffLength);
// unnecessary code ommited
}
}
as you can see, retValBuffLength is calculated, which knows the amount of replaceCount's. The real implementation can be a bit different for .NET 4.0(SSCLI 4.0 is not released), but I assure you it's not doing anything silly :-).
I was wondering if there is a better, more efficient way of replacing
all occurrences of a specific character/substring in a string, that
does not additionally create an string.
Yes. Reusable StringBuilder that has capacity of ~2000 characters. Avoid any memory allocation. This is only true if the the replacement lengths are equal, and can get you a nice performance gain if you're in tight loop.
Before writing anything, run benchmarks with big files, and see if the performance is enough for you. If performance is enough - don't do anything.
Well, I'm not a .NET development team member (unfortunately), but I'll try to answer your question.
Microsoft has a great site of .NET Reference Source code, and according to it, String.Replace calls an external method that does the job. I wouldn't argue about how it is implemented, but there's a small comment to this method that may answer your question:
// This method contains the same functionality as StringBuilder Replace. The only difference is that
// a new String has to be allocated since Strings are immutable
Now, if we'll follow to StringBuilder.Replace implementation, we'll see what it actually does inside.
A little more on a string objects:
Although String is immutable in .NET, this is not some kind of limitation, it's a contract. String is actually a reference type, and what it includes is the length of the actual string + the buffer of characters. You can actually get an unsafe pointer to this buffer and change it "on the fly", but I wouldn't recommend doing this.
Now, the StringBuilder class also holds a character array, and when you pass the string to its constructor it actually copies the string's buffer to his own (see Reference Source). What it doesn't have, though, is the contract of immutability, so when you modify a string using StringBuilder you are actually working with the char array. Note that when you call ToString() on a StringBuilder, it creates a new "immutable" string any copies his buffer there.
So, if you need a fast and memory efficient way to make changes in a string, StringBuilder is definitely your choice. Especially regarding that Microsoft explicitly recommends to use StringBuilder if you "perform repeated modifications to a string".
I haven't found any sources but i strongly doubt that the implementation creates always new strings. I'd implement it also with a StringBuilder internally. Then String.Replace is absolutely fine if you want to replace once a huge string. But if you have to replace it many times you should consider to use StringBuilder.Replace because every call of Replace creates a new string.
So you can use StringBuilder.Replace since you're already using a StringBuilder.
Is StringBuilder.Replace() more efficient than String.Replace?
String.Replace() vs. StringBuilder.Replace()
There is no string method for that. You are own your own. But you can try something like this:
oldFormat="dd/mm/yyyy";
string[] dt = oldFormat.Split('/');
string newFormat = string.Format("{0}{1}/{2}", dt[0], dt[1], dt[2]);
or
StringBuilder sb = new StringBuilder(dt[0]);
sb.AppendFormat("{0}/{1}", dt[1], dt[2]);

StringBuilder OutOfMemoryException

I have a StringBuilder that appends all the pixel in an image, this amount being extremely large. Every time I run my program, everything goes well, but once I change a pixel color (ArGB) I get a OutOfMemoryException at the spot where I clear the StringBuilder. The problem is that I need to create an instance of StreamWriter then add my text to it THEN set the file path.| My current code it:
StringBuilder PixelFile = new StringBuilder("", 5000);
Private void Render()
{
//One second run, I get an OutOfMemoryException
PixelFile.Clear();
//This is in a for but cut it out for reverence.
PixelFile.Append(ArGBFormat);
}
I do not know what is causing this. I have tried PixelFile.Length = 0; and PixelFile.Capacity = 0;
OutOfMemory probably means you're building the string too big for StringBuilder, which is designed to handle a very different type of operation.
While I'm at a loss for how to make StringBuilder work, let me point you at a more intuitive implementation that will be less likely to fail.
You can read and write from a file using direct binary through the BinaryReader and BinaryWriter classes. This can also save you a lot of effort since you can make sure you're serializing bytes instead of character strings or entire words.
If you absolutely must use plaintext, consider the StreamReader and StreamWriter classes directly, as they won't throw exceptions for size. Remember, streams are intended for this sort of operation, StringBuilder is not, so Streams are far more likely to work with far less effort on your part.
EDIT:
When the maximum capacity is reached, no further memory can be allocated for the StringBuilder object, and trying to add characters or expand it beyond its maximum capacity throws either an ArgumentOutOfRangeException or an OutOfMemoryException exception.
Therefore, this is a limitation of the StringBuilder class and cannot be overcome with your current implementation.
EDIT: Additional implementation
In addition to StreamWriters which can write directly to files, you can also use the MemoryStream class to pipe information to memory instead of disk. Be aware this could lead to slow performance of the program, and I recommend instead trying to refactor the process to only need to perform a stream once.
That being said, it is still possible.
var mem = new MemoryStream();
var memWriter = new StreamWriter(mem);
// TODO: use memWriter.Write as per StreamWriter
mem.Position = 0; // This ensures you are copying your stream from the beginning
// TODO: Show your file save dialog
var fileStream = new StreamWriter(fileNameFromDialog);
mem.CopyTo(fileWriter); // Perform the copy

Is the StreamReader class in C# the best choice for reading my data?

I need to read in a text file that can range from 8k to 5MB. This file is made up of a single line of text. No Carriage returns or End of Lines. I then need to break it down by to its individual pieces. Those pieces are delimited by size. For example, the first chuck of information is made up of 240 characters. In that 240 characters the first 30 are the Name field. The next 35 are the Address, and so on. Parsing aside, is the StreamReader class the best choice for reading it into memory?
Look a the TextFieldParser class, though in the Microsoft.VisualBasic.FileIO namespace, it can easily be used with C#.
The class description on MSDN is:
Provides methods and properties for parsing structured text files.
An example usage would be:
using(var tfp = new TextFieldParser("path to text file"))
{
tfp.TextFieldType = FieldType.FixedWidth;
tfp.FieldWidths = new int[] {5, 10, 11, -1};
}
I'd very much recommend to use a StreamReader as opposed to reading all the text into a string for reasons of heap efficiency. I have had lots of trouble with strings over 2Mb without to much effort (on 32bit .NET).
Do you need further guidance? It seems to me you might be looking for help in treating the stream. It is common for programmers to have more experience in handling strings, and therefore preferring stringy solutions.
If you paste some more specifcs about the structure of the data, I could help you out a bit. For now, just a single general pointer:
All general-purpose parsers and lexers employ a streaming input model. e.g. Look at Coco/C# for a simple to use parser generator.

non contiguous String object C#.net

By what i understand String and StringBuilder objects both allocate contiguous memory underneath.
My program runs for days buffering several output in a String object. This sometimes cause outofmemoryexception which i think is because of non availability of contiguous memory. my string size can go upto 100MBs and i m concatenating new string frequently this causes new string object being allocated. i can reduce new string object creation by using Stringbuilder but that would not solve my problem entirely
Is there an alternative to a contiguous string object?
A rope data structure may be an option but I don't know of any ready-to-use implementations for .NET.
Have you tried using a LinkedList of strings instead? Or perhaps you can modify your architecture to read and write a file on disk instead of keeping everything in memory.
DO NOT USE STRINGS.
Strings will copy and allocate a new string for every operation. That is, if you have an 50mb string and add one character, until garbage collection happens, you will have two (aprox) 50mb strings around.
Then, you add another char, you'll have 3.... and so on.
On the other hand, proper use of StringBuilder, that is, using "Append" should not have any problem with 100 mbs.
Another optimization is creating the StringBuilder with your estimated size,
StringBuilder SB;
SB= new StringBuilder(capacity); // being capacity the suggested starting size
Use stringBuider to hold your big string, and then use append.
HTH
By going so large your strings are moved to the Large Object Heap (LOH) and you run a greater risk of fragmentation.
A few options:
Use a StringBuilder. You will be re-allocating less frequently. And try to pre-allocate, like new StringBuilder(100*1000*1000);
re-design your solution. There must be alternatives to keeping such large strings around. A List<string> for instance, that is only converted to 1 single string when (really) necessary.
I don't believe there's any solution for this using either String or StringBuilder. Both will require contiguous memory. Is it possible to change your architecture such that you can save the ongoing data to a List, a file, a database, or some other structure designed for such purposes?
First you should examine why you are doing that and see if there are other things you can do that give you the same value.
Then you have lots of options (depending on what you need) ranging from using logging to writing a simple class that collects strings into a List.
You can try saving the string to a database such as TextFile, SQL Server Express, MySQL, MS Access, ..etc. This way if your server gets shutdown for any reason (Power outage, someone bumped the UPS, thunderstorm, etc) you would not lose your data. It is a little slower then RAM but I think the trade off is worth it.
If this is not an option -- Most definitly use the stringbuilder for adding strings.

Categories

Resources