Why string pointer position is different? - c#

Why string pointer position is different each time I ran the application, when I'm using StringBuilder but same when I declare a variable?
void Main()
{
string str_01 = "my string";
string str_02 = GetString();
unsafe
{
fixed (char* pointerToStr_01 = str_01)
{
fixed (char* pointerToStr_02 = str_02)
{
Console.WriteLine((Int64)pointerToStr_01);
Console.WriteLine((Int64)pointerToStr_02);
}
}
}
}
private string GetString()
{
StringBuilder sb = new StringBuilder();
sb.Append("my string");
return sb.ToString();
}
Output:
40907812
178488268
next time:
40907812
179023248
next time:
40907812
178448964

str_01 holds a reference to constant string. StringBuilder however builds string instances dynamically, so the returned string instance is not referentially the same instance as the constant string with the same content. System.Object.ReferenceEquals() will return false.
Since the str_01 is a reference to a constant string, its data is probably stored in a data section of the executable, which always gets the same address in the application virtual address space.
Edit:
You can see the "my string" text in UTF-8 encoding when you open the compiled .exe file using PE.Explorer or similar software. It is present in the .data section of the file, including a preferred Virtual Address where the section should be loaded in process virtual memory.
I have however not been able to reproduce that str_01 has a same address on multiple runs of the application, probably because my x64 Windows 8.1 performs Address space layout randomization (ASLR). Because of that, all pointers will be different across multiple runs of the application, even those that point directly to loaded PE sections.

Just because two strings are equal that doesn't mean they point to the same references (which I guess would mean having the same pointers), C# does not intern all strings automatically because of performance considerations and what not. If you want the pointers to be the same for both strings you can intern str_02 using string.Intern.

when i use fixed it will allocate a memory
as str_01 is constant string, it allocates memory on execution and points to same location every time
fixed (char* pointerToStr_01 = str_01)
but in case of
fixed (char* pointerToStr_02 = str_02)
its dynamically allocating the memory hence the pointing location varies every time
hence there is diffrence in the string pointer each time we run

I am not agree that output for
Console.WriteLine((Int64)pointerToStr_01);
is same for you always as I tested it personally to make my point more clear.
Lets have a look in both cases:
In case of string str_01 = "my string", when you will print the pointer value of this variable it will not the same as previous because every time a new String object is created (i.e. string is Immutable) and "my string" is assigned to it. Then within Fixed statement you are printing the pointer's value which is out of scope when you execute the program again and previous value will not be remembered.
I think, till now you can self-explain the behavior of StringBuilder.
Also check with:
string str_01 = GetString();
private static string GetString()
{
var sb = new String(new char[] {'m','y',' ','s','t','r','i','n','g'});
return sb;
}

Related

Checking work of C# String.Intern method through ProcessHacker

I'm playing around with C# String.Intern method and have one questions. Suppose I have a program that reads a text file line by line and adds this lines to a list of strings. Let's assume that this file consists of thousands of lines of the same string. If the text file is big enough I can see that my program consumes decent amount of RAM. Then if I use String.Intern method when I add lines to my list, consumptions of memory drops significantly and this means that string interning works fine. Then I want to check how many strings my dotnet process has through ProcessHacker. But whether I use String.Intern or not ProcessHacker shows the same huge amount of duplicating string. I expect it would show only one instance of the string since I use String.Intern.
What do I miss?
static void Main(string[] args)
{
List<string> list = new List<string>();
string filePath = #"C:\Users\User\Desktop\1.txt";
using (var fileStream = File.OpenRead(filePath))
{
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
list.Add(line);
//list.Add(String.Intern(line));
}
}
}
}
Every streamReader.ReadLine() will always create a new string which will be garbage collected but until GC it will exist in memory. Your memory consumption can drop cause String.Intern returns the system's reference to string, if it is interned; otherwise, a new reference to a string with the value of string and your list will consist from references to the same instance of string which was interned making the ones created by streamReader.ReadLine() available for GC.
var str = "Test"; // compile time constant will be interned by default
var str1 = new string(str.ToArray()); // simulate reading string
Console.WriteLine(object.ReferenceEquals(str1, string.Intern(str1))); // prints false

Why is there still a reference onto this string?

I was tyoing around with WeakReference and WeakReference<T>. They only work with classes (obviously, reference) so I did an example with a string (string is a class in .Net).
When I ran the following code snippet, it didn't provide the result I expected, in the meaning that the WeakReference still contained the string.
string please = "wat";
WeakReference<string> test = new WeakReference<string>(please);
string testresult;
please = null;
GC.Collect();
bool worked = test.TryGetTarget(out testresult);
Console.WriteLine("it is " + worked);
result: "it is true"
Now when I created a simple wrapper class around the string:
class TestWeakStuff
{
public string Test { get; set; }
}
and used it instead of the string, it did return my expected result:
TestWeakStuff testclass = new TestWeakStuff() { Test = "wat" };
WeakReference<TestWeakStuff> test2 = new WeakReference<TestWeakStuff>(testclass);
TestWeakStuff testresult2;
testclass = null;
GC.Collect();
bool worked2 = test2.TryGetTarget(out testresult2);
Console.WriteLine("2nd time is " + worked2);
Result: "2nd time is false"
I tried the same with the non generic WeakReference class, and the result is the same.
Why is the String not being claimed by the garbage collector?
(GC.Collect() does claim all generations, external GC call is with -1 (all generations))
String literals are not a good candidate to test GC behavior. String literals are added to the intern pool on the CLR. This causes only one object for each distinct string literal to live in memory. This is an optimization. Strings in the intern pool are referenced forever and never collected.
Strings are not an ordinary class. They are intrinsics to the runtime.
You should be able to test it with new string('x', 10) which creates a new object each time. This is guaranteed to be so. Sometimes, this is being used to use unsafe code to write to strings before publishing them to other code. Can be using with native code as well.
It's probably best to drop testing strings entirely. The results you obtain are not particularly interesting or guaranteed to remain stable across runtime changes.
You could test with new object() which would be the simplest way to test it.

Performance considerations for String and StringBuilder - C#

All,
For the string string s = "abcd", does string w = s.SubString(2) return a new allocated String object i.e. string w = new String ("cd") internally or a String literal?
For StringBuilder, when appending string values and if the size of the StringBuilder needs to be increased, are all the contents copied over to a new memory location or simply the pointers to each of the earlier String value are reassigned to the new location?
String is immutable, so any operation that "changes" the string, will in effect return a new string. This includes SubString and all other operations on String, including those that does not change the length (such as ToLower() or similar).
StringBuilder contains internally a linked list of chunks of characters. When it needs to grow, a new chunk is allocated and inserted at the end of the list, and data is copied here. In other words, the whole StringBuilder buffer will not be copied on an append, only the data you are appending. I double-checked this against the Framework 4 reference sources.
For the string string s = "abcd", does string w = s.SubString(2) return a new allocated String object? Yes
For StringBuilder, when appending string values and if the size of the StringBuilder needs to be increased, are all the contents copied over to a new memory location? Yes
Any change in String small or large results in a new String
If you are going to make large numbers of edits to a string it better to do this via StringBuilder.
From MSDN:
You can use the StringBuilder class instead of the String class for operations that make multiple changes to the value of a string. Unlike instances of the String class, StringBuilder objects are mutable; when you concatenate, append, or delete substrings from a string, the operations are performed on a single string.
Strings are immutable objects so every time you had to make changes you create a new instance of that string. The substring method does not change the value of the original string.
Regards.
Difference between the String and StringBuilder is an important concept which makes the difference when an application has to deal with the editing of a high number of Strings.
String
The String object is a collection of UTF-16 code units represented by a System.Char object which belong to the System namespace. Since the value of this objects are read-only, the entire object String has defined as immutable. The maximum size of a String object in memory is 2 GB, or about 1 billion characters.
Immutable
Being immutable means that every time a methods of the System.String is used, a new sting object is created in memory and this cause a new allocation of space for the new object.
Example:
By using the string concatenation operator += appears that the value of the string variable named test change. In fact, it create a new String object, which has a different value and address from the original and assign it to the test variable.
string test;
test += "red"; // a new object string is created
test += "coding"; // a new object string is created
test += "planet"; // a new object string is created
StringBuilder
The StringBuilder is a dynamic object which belong to the System.Text namespace and allow to modify the number of characters in the string that it encapsulates, this characteristic is called mutability.
Mutability
To be able to append, remove, replace or insert characters, A StringBuilder maintains a buffer to accommodate expansions to the string. If new data is appended to the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, and the new data is then appended to the new buffer.
StringBuilder sb = new StringBuilder("");
sb.Append("red");
sb.Append("blue");
sb.Append("green ");
string colors = sb.ToString();
Performances
In order to help you better understand the performance difference between String and StringBuilder, I created the following example:
Stopwatch timer = new Stopwatch();
string str = string.Empty;
timer.Start();
for (int i = 0; i < 10000; i++) {
str += i.ToString();
}
timer.Stop();
Console.WriteLine("String : {0}", timer.Elapsed);
timer.Restart();
StringBuilder sbr = new StringBuilder(string.Empty);
for (int i = 0; i < 10000; i++) {
sbr.Append(i.ToString());
}
timer.Stop();
Console.WriteLine("StringBuilder : {0}", timer.Elapsed);
The output is
Output
String : 00:00:00.0706661
StringBuilder : 00:00:00.0012373

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Why does .NET create new substrings instead of pointing into existing strings?

From a brief look using Reflector, it looks like String.Substring() allocates memory for each substring. Am I correct that this is the case? I thought that wouldn't be necessary since strings are immutable.
My underlying goal was to create a IEnumerable<string> Split(this String, Char) extension method that allocates no additional memory.
One reason why most languages with immutable strings create new substrings rather than refer into existing strings is because this will interfere with garbage collecting those strings later.
What happens if a string is used for its substring, but then the larger string becomes unreachable (except through the substring). The larger string will be uncollectable, because that would invalidate the substring. What seemed like a good way to save memory in the short term becomes a memory leak in the long term.
Not possible without poking around inside .net using String classes. You would have to pass around references to an array which was mutable and make sure no one screwed up.
.Net will create a new string every time you ask it to. Only exception to this is interned strings which are created by the compiler (and can be done by you) which are placed into memory once and then pointers are established to the string for memory and performance reasons.
Each string has to have it's own string data, with the way that the String class is implemented.
You can make your own SubString structure that uses part of a string:
public struct SubString {
private string _str;
private int _offset, _len;
public SubString(string str, int offset, int len) {
_str = str;
_offset = offset;
_len = len;
}
public int Length { get { return _len; } }
public char this[int index] {
get {
if (index < 0 || index > len) throw new IndexOutOfRangeException();
return _str[_offset + index];
}
}
public void WriteToStringBuilder(StringBuilder s) {
s.Write(_str, _offset, _len);
}
public override string ToString() {
return _str.Substring(_offset, _len);
}
}
You can flesh it out with other methods like comparison that is also possible to do without extracting the string.
Because strings are immutable in .NET, every string operation that results in a new string object will allocate a new block of memory for the string contents.
In theory, it could be possible to reuse the memory when extracting a substring, but that would make garbage collection very complicated: what if the original string is garbage-collected? What would happen to the substring that shares a piece of it?
Of course, nothing prevents the .NET BCL team to change this behavior in future versions of .NET. It wouldn't have any impact on existing code.
Adding to the point that Strings are immutable, you should be that the following snippet will generate multiple String instances in memory.
String s1 = "Hello", s2 = ", ", s3 = "World!";
String res = s1 + s2 + s3;
s1+s2 => new string instance (temp1)
temp1 + s3 => new string instance (temp2)
res is a reference to temp2.

Categories

Resources