Get UTF8.GetBytes from very large stringbuilder

Get UTF8.GetBytes from very large stringbuilder - c#

I have a StringBuilder of length 1,539,121,968. When calling StringBuilder .ToString() on it fails with OutOfMemoryException. I tried creating a char array but was not allowed to create such big array.
I need to store it a byte array in UTF8 format. Is it possible?

I'd suggest looking at the documentation for streams. As this might help.
Another way to approach it would be to split it up. As for your last comment stating that you wish to store it as a ByteArray with UTF8 you'd need a char[] as else you'd lose your encoding. I'd reccomend splitting it into many smaller strings (or char[]s) stored in separate objects that can easily be reconstructed. Something like this might suffice, create many StringSlices:
public class StringSlice()
{
public Str {get;}
public Index {get;}
public StringSlice(string str, int index)
{
this.Str = str;
this.Index = index;
}
public static List<string> ReconstructString(IEnumerable<StringSlice> parts)
{
//Sort input by index return list with new strings in order. Probably have to use a buffer on the disc so as not to breach 2GB obj limit.
}
}
In essence what you would be doing here is similar to the way internet packets are split and reconstructed. I'm not entirely sure if I've answered your question but hopefully this comes some way to helping.

Related

Replace char(0x10) with a String (The Optimized way)

This is a common question but I hope this does not get tagged as a duplicate since the nature of the question is different (please read the whole not only the title)
Unaware of the existence of String.Replace I wrote the following:
int theIndex = 0;
while ((theIndex = message.IndexOf(separationChar, theIndex)) != -1) //we found the character
{
theIndex++;
if (theIndex < message.Length)//not in the last position
{
message = message.Insert(theIndex, theTime);
}
else
{
// I dont' think this is really neccessary
break;
}
} //while finding characters
As you can see I am replacing occurrences of separationChar in the message String with a String called "theTime".
Now, this works ok for small strings but I have been given a really huge String (in the order of several hundred Kbytes- by the way is there a limit for String or StringBuilder??) and it takes a lot of time...
So my questions are:
1) Is it more efficient if I just do
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
2) Is there any other way I can process very long Strings to insert a String (theTime) when finding some char in a very fast and efficient way??
Thanks a lot

As Danny already mentioned, string.Insert() actually creates a new instance each time you use it, and these also have to be garbage collected at some point.
You could instead start with an empty StringBuilder to construct the result string:
public static string Replace(this string str, char find, string replacement)
{
StringBuilder result = new StringBuilder(str.Length); // initial capacity
int pointer = 0;
int index;
while ((index = str.IndexOf(find, pointer)) >= 0)
{
// Append the unprocessed data up to the character
result.Append(str, pointer, index - pointer);
// Append the replacement string
result.Append(replacement);
// Next unprocessed data starts after the character
pointer = index + 1;
}
// Append the remainder of the unprocessed data
result.Append(str, pointer, str.Length - pointer);
return result.ToString();
}
This will not cause a new string to be created (and garbage collected) for each occurrence of the character. Instead, when the internal buffer of the StringBuilder is full, it will create a new buffer chunk "of sufficient capacity". Quote from reference source, when its buffer is full:
Compute the length of the new block we need
We make the new chunk at least big enough for the current need (minBlockCharCount), but also as big as the current length (thus doubling capacity), up to a maximum
(so we stay in the small object heap, and never allocate really big chunks even if
the string gets really big).

Thank you for answering my question.
I am writing an answer because I have to report that I tried the solution in my question 1) and it is indeed more efficient according to the results of my program. String.Replace can replace a string(from a char) with another string very fast.
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);

How slow is passing large strings as return values in C#?

I want to know how return values for strings works for strings in C#. In one of my functions, I generate html code and the string is really huge, I then return it from the function, and then insert it into the page. But I want to know should I pass a huge string as a return value, or just insert it into the page from the same function?
When C# returns a string, does it create a new string from the old one, and return that?
Thanks.

Strings (or any other reference type) are not copied when returning from a function, only value types are.

System.String is a reference type (class) and so passing as parameter and returning only involve the copying of a reference (32 or 64 bits).
The size of the string is not relevant.

Returning a string is a cheap operation - as mentioned it's purely a matter of returning 32 or 64 bits (4 or 8 bytes).
However, as Sten Petrov points out string + operations involve the creation of a new string, and can be a little expensive. If you wanted to save performance & memory I'd suggest doing something like this:
static int i = 0;
static void Main(string[] args)
{
while (Console.ReadLine() == "")
{
var pageSB = new StringBuilder();
foreach (var section in new[] { AddHeader(), AddContent(), AddFooter() })
for (int i = 0; i < section.Length; i++)
pageSB.Append(section[i]);
Console.Write(pageSB.ToString());
}
}
static StringBuilder AddHeader()
{
return new StringBuilder().Append("Hi ").AppendLine("World");
}
static StringBuilder AddContent()
{
return new StringBuilder()
.AppendFormat("This page has been viewed: {0} times\n", ++i);
}
static StringBuilder AddFooter()
{
return new StringBuilder().Append("Bye ").AppendLine("World");
}
Here we use the StringBuilders to hold a reference to all the strings we want to concat, and wait until the very end before joining them together. This'll save many unnecessary additions (which are memory and CPU heavy in comparison).
Of course, I doubt you'll actually see any need for this in practise - and if you do I'd spend some time learning about pooling etc. to help reduce the garbage created by all the string builders - and maybe consider creating a custom 'string holder' that suits your purposes better.

How to numerically order array of delimited strings in C#

I'm in a little bit of a bind. I'm working with a legacy system that contains a bunch of delimited strings which I need to parse. Unfortunately, the strings need to be ordered based on the first part of the string. The array looks something like
array[0] = "10|JohnSmith|82";
array[1] = "1|MaryJane|62";
array[2] = "3|TomJones|77";
So I'd like the array to order to look like
array[0] = "1|MaryJane|62";
array[1] = "3|TomJones|77";
array[2] = "10|JohnSmith|82";
I thought about doing a 2 dimensional array to grab the first part and leave the string in the second part, but can I mix types in a two dimensional array like that?
I'm not sure how to handle this situation, can anyone help? Thanks!

Call Array.Sort, but passing in a custom implementation of IComparer<string>:
// Give it a proper name really :)
public class IndexComparer : IComparer<string>
{
public int Compare(string first, string second)
{
// I'll leave you to decide what to do if the format is wrong
int firstIndex = GetIndex(first);
int secondIndex = GetIndex(second);
return firstIndex.CompareTo(secondIndex);
}
private static int GetIndex(string text)
{
int pipeIndex = text.IndexOf('|');
return int.Parse(text.Substring(0, pipeIndex));
}
}
Alternatively, convert from a string array into an array of custom types by splitting the string up appropriately. This will make life easier if you're going to do further work on the array, but if you only need to sort the values, then you might as well use the code above.
You did say that you need to parse the strings - so is there any particular reason why you'd want to parse them before sorting them?

new[] {
"10|JohnSmith|82",
"1|MaryJane|62",
"3|TomJones|77",
}.OrderBy(x => int.Parse(x.Split('|')[0]));

Use an ArrayList (http://msdn.microsoft.com/en-us/library/system.collections.arraylist_methods(VS.80).aspx) so you can sort it.

If the array is large, you will want to extract the initial integers all in one pass, so you are not parsing strings at every comparison. IMO, you really want to encapsulate the information encoded in the strings into a class first. Then sort the array of those objects.
Something like:
class Person {
int Index { get; }
string Name { get; }
int Age { get; } // just guessing the semantic meaning
}
So then:
Map your encoded string into an ArrayList of Person objects.
Then use ArrayList.Sort(IComparer) where your comparer only looks at the Index.
This will likely perform better than using parse in every comparison.

for lolz
Array.Sort(array, ((x,y) => (int.Parse(x.Split('|')[0]) < int.Parse(y.Split('|')[0])) ? -1 : (int.Parse(x.Split('|')[0]) > int.Parse(y.Split('|')[0])) ? 1 : 0));

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#

Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.

Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}

I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}

Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.

1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.

What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.

If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Why does .NET create new substrings instead of pointing into existing strings?

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.
My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.

One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.
What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.

Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.
.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.

Each string has to have it's own string data, with the way that the String class is implemented.
You can make your own SubString structure that uses part of a string:
public struct SubString {
private string _str;
private int _offset, _len;
public SubString(string str, int offset, int len) {
_str = str;
_offset = offset;
_len = len;
}
public int Length { get { return _len; } }
public char this[int index] {
get {
if (index < 0 || index > len) throw new IndexOutOfRangeException();
return _str[_offset + index];
}
}
public void WriteToStringBuilder(StringBuilder s) {
s.Write(_str, _offset, _len);
}
public override string ToString() {
return _str.Substring(_offset, _len);
}
}
You can flesh it out with other methods like comparison that is also possible to do without extracting the string.

Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.
In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?
Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.

Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.
String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;
s1+s2 => new string instance (temp1)
temp1 + s3 => new string instance (temp2)
res is a reference to temp2.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get UTF8.GetBytes from very large stringbuilder - c#

I have a StringBuilder of length 1,539,121,968. When calling StringBuilder .ToString() on it fails with OutOfMemoryException. I tried creating a char array but was not allowed to create such big array. I need to store it a byte array in UTF8 format. Is it possible?

Related

Replace char(0x10) with a String (The Optimized way)

How slow is passing large strings as return values in C#?

How to numerically order array of delimited strings in C#

How to optimize memory usage in this algorithm?

Why does .NET create new substrings instead of pointing into existing strings?

Categories

Resources