I'm playing around with C# String.Intern method and have one questions. Suppose I have a program that reads a text file line by line and adds this lines to a list of strings. Let's assume that this file consists of thousands of lines of the same string. If the text file is big enough I can see that my program consumes decent amount of RAM. Then if I use String.Intern method when I add lines to my list, consumptions of memory drops significantly and this means that string interning works fine. Then I want to check how many strings my dotnet process has through ProcessHacker. But whether I use String.Intern or not ProcessHacker shows the same huge amount of duplicating string. I expect it would show only one instance of the string since I use String.Intern.
What do I miss?
static void Main(string[] args)
{
List<string> list = new List<string>();
string filePath = #"C:\Users\User\Desktop\1.txt";
using (var fileStream = File.OpenRead(filePath))
{
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
list.Add(line);
//list.Add(String.Intern(line));
}
}
}
}
Every streamReader.ReadLine() will always create a new string which will be garbage collected but until GC it will exist in memory. Your memory consumption can drop cause String.Intern returns the system's reference to string, if it is interned; otherwise, a new reference to a string with the value of string and your list will consist from references to the same instance of string which was interned making the ones created by streamReader.ReadLine() available for GC.
var str = "Test"; // compile time constant will be interned by default
var str1 = new string(str.ToArray()); // simulate reading string
Console.WriteLine(object.ReferenceEquals(str1, string.Intern(str1))); // prints false
Related
Why string pointer position is different each time I ran the application, when I'm using StringBuilder but same when I declare a variable?
void Main()
{
string str_01 = "my string";
string str_02 = GetString();
unsafe
{
fixed (char* pointerToStr_01 = str_01)
{
fixed (char* pointerToStr_02 = str_02)
{
Console.WriteLine((Int64)pointerToStr_01);
Console.WriteLine((Int64)pointerToStr_02);
}
}
}
}
private string GetString()
{
StringBuilder sb = new StringBuilder();
sb.Append("my string");
return sb.ToString();
}
Output:
40907812
178488268
next time:
40907812
179023248
next time:
40907812
178448964
str_01 holds a reference to constant string. StringBuilder however builds string instances dynamically, so the returned string instance is not referentially the same instance as the constant string with the same content. System.Object.ReferenceEquals() will return false.
Since the str_01 is a reference to a constant string, its data is probably stored in a data section of the executable, which always gets the same address in the application virtual address space.
Edit:
You can see the "my string" text in UTF-8 encoding when you open the compiled .exe file using PE.Explorer or similar software. It is present in the .data section of the file, including a preferred Virtual Address where the section should be loaded in process virtual memory.
I have however not been able to reproduce that str_01 has a same address on multiple runs of the application, probably because my x64 Windows 8.1 performs Address space layout randomization (ASLR). Because of that, all pointers will be different across multiple runs of the application, even those that point directly to loaded PE sections.
Just because two strings are equal that doesn't mean they point to the same references (which I guess would mean having the same pointers), C# does not intern all strings automatically because of performance considerations and what not. If you want the pointers to be the same for both strings you can intern str_02 using string.Intern.
when i use fixed it will allocate a memory
as str_01 is constant string, it allocates memory on execution and points to same location every time
fixed (char* pointerToStr_01 = str_01)
but in case of
fixed (char* pointerToStr_02 = str_02)
its dynamically allocating the memory hence the pointing location varies every time
hence there is diffrence in the string pointer each time we run
I am not agree that output for
Console.WriteLine((Int64)pointerToStr_01);
is same for you always as I tested it personally to make my point more clear.
Lets have a look in both cases:
In case of string str_01 = "my string", when you will print the pointer value of this variable it will not the same as previous because every time a new String object is created (i.e. string is Immutable) and "my string" is assigned to it. Then within Fixed statement you are printing the pointer's value which is out of scope when you execute the program again and previous value will not be remembered.
I think, till now you can self-explain the behavior of StringBuilder.
Also check with:
string str_01 = GetString();
private static string GetString()
{
var sb = new String(new char[] {'m','y',' ','s','t','r','i','n','g'});
return sb;
}
I have a text file, which I am trying to insert a line of code into. Using my linked-lists I believe I can avoid having to take all the data out, sort it, and then make it into a new text file.
What I did was come up with the code below. I set my bools, but still it is not working. I went through debugger and what it seems to be going on is that it is going through the entire list (which is about 10,000 lines) and it is not finding anything to be true, so it does not insert my code.
Why or what is wrong with this code?
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
using (StreamReader inFile = new StreamReader("Students.txt", true))
{
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i];
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.CompareTo(lastName) < 0)
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
{
lines.Add(newRecord);
}
You're just reading the file into memory and not committing it anywhere.
I'm afraid that you're going to have to load and completely re-write the entire file. Files support appending, but they don't support insertions.
you can write to a file the same way that you read from it
string[] lines;
/// instanciate and build `lines`
File.WriteAllLines("path", lines);
WriteAllLines also takes an IEnumerable, so you can past a List of string into there if you want.
one more issue: it appears as though you're reading your file twice. one with ReadAllLines and another with your StreamReader.
There are at least four possible errors.
The opening of the streamreader is not required, you have already read
all the lines. (Well not really an error, but...)
The check for StartsWith can be fooled if you lines starts with blank
space and you will miss the insertionPoint. (Adding a Trim will remove any problem here)
In the CompareTo line you check for < 0 but you should check for == 0. CompareTo returns 0 if the strings are equivalent, however.....
To check if two string are equals you should avoid using CompareTo as
explained in MSDN link above but use string.Equals
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i].Trim();
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.Equals(lastName))
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
lines.Add(newRecord);
I don't list as an error the missing write back to the file. Hope that you have just omitted that part of the code. Otherwise it is a very simple problem.
(However I think that the way in which CompareTo is used is probably the main reason of your problem)
EDIT Looking at your comment below it seems that the answer from Sam I Am is the right one for you. Of course you need to write back the modified array of lines. All the changes are made to an in memory array of lines and nothing is written back to a file if you don't have code that writes a file. However you don't need new file
File.WriteAllLines("Students.txt", lines);
I want to know how return values for strings works for strings in C#. In one of my functions, I generate html code and the string is really huge, I then return it from the function, and then insert it into the page. But I want to know should I pass a huge string as a return value, or just insert it into the page from the same function?
When C# returns a string, does it create a new string from the old one, and return that?
Thanks.
Strings (or any other reference type) are not copied when returning from a function, only value types are.
System.String is a reference type (class) and so passing as parameter and returning only involve the copying of a reference (32 or 64 bits).
The size of the string is not relevant.
Returning a string is a cheap operation - as mentioned it's purely a matter of returning 32 or 64 bits (4 or 8 bytes).
However, as Sten Petrov points out string + operations involve the creation of a new string, and can be a little expensive. If you wanted to save performance & memory I'd suggest doing something like this:
static int i = 0;
static void Main(string[] args)
{
while (Console.ReadLine() == "")
{
var pageSB = new StringBuilder();
foreach (var section in new[] { AddHeader(), AddContent(), AddFooter() })
for (int i = 0; i < section.Length; i++)
pageSB.Append(section[i]);
Console.Write(pageSB.ToString());
}
}
static StringBuilder AddHeader()
{
return new StringBuilder().Append("Hi ").AppendLine("World");
}
static StringBuilder AddContent()
{
return new StringBuilder()
.AppendFormat("This page has been viewed: {0} times\n", ++i);
}
static StringBuilder AddFooter()
{
return new StringBuilder().Append("Bye ").AppendLine("World");
}
Here we use the StringBuilders to hold a reference to all the strings we want to concat, and wait until the very end before joining them together. This'll save many unnecessary additions (which are memory and CPU heavy in comparison).
Of course, I doubt you'll actually see any need for this in practise - and if you do I'd spend some time learning about pooling etc. to help reduce the garbage created by all the string builders - and maybe consider creating a custom 'string holder' that suits your purposes better.
All,
For the string string s = "abcd", does string w = s.SubString(2) return a new allocated String object i.e. string w = new String ("cd") internally or a String literal?
For StringBuilder, when appending string values and if the size of the StringBuilder needs to be increased, are all the contents copied over to a new memory location or simply the pointers to each of the earlier String value are reassigned to the new location?
String is immutable, so any operation that "changes" the string, will in effect return a new string. This includes SubString and all other operations on String, including those that does not change the length (such as ToLower() or similar).
StringBuilder contains internally a linked list of chunks of characters. When it needs to grow, a new chunk is allocated and inserted at the end of the list, and data is copied here. In other words, the whole StringBuilder buffer will not be copied on an append, only the data you are appending. I double-checked this against the Framework 4 reference sources.
For the string string s = "abcd", does string w = s.SubString(2) return a new allocated String object? Yes
For StringBuilder, when appending string values and if the size of the StringBuilder needs to be increased, are all the contents copied over to a new memory location? Yes
Any change in String small or large results in a new String
If you are going to make large numbers of edits to a string it better to do this via StringBuilder.
From MSDN:
You can use the StringBuilder class instead of the String class for operations that make multiple changes to the value of a string. Unlike instances of the String class, StringBuilder objects are mutable; when you concatenate, append, or delete substrings from a string, the operations are performed on a single string.
Strings are immutable objects so every time you had to make changes you create a new instance of that string. The substring method does not change the value of the original string.
Regards.
Difference between the String and StringBuilder is an important concept which makes the difference when an application has to deal with the editing of a high number of Strings.
String
The String object is a collection of UTF-16 code units represented by a System.Char object which belong to the System namespace. Since the value of this objects are read-only, the entire object String has defined as immutable. The maximum size of a String object in memory is 2 GB, or about 1 billion characters.
Immutable
Being immutable means that every time a methods of the System.String is used, a new sting object is created in memory and this cause a new allocation of space for the new object.
Example:
By using the string concatenation operator += appears that the value of the string variable named test change. In fact, it create a new String object, which has a different value and address from the original and assign it to the test variable.
string test;
test += "red"; // a new object string is created
test += "coding"; // a new object string is created
test += "planet"; // a new object string is created
StringBuilder
The StringBuilder is a dynamic object which belong to the System.Text namespace and allow to modify the number of characters in the string that it encapsulates, this characteristic is called mutability.
Mutability
To be able to append, remove, replace or insert characters, A StringBuilder maintains a buffer to accommodate expansions to the string. If new data is appended to the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, and the new data is then appended to the new buffer.
StringBuilder sb = new StringBuilder("");
sb.Append("red");
sb.Append("blue");
sb.Append("green ");
string colors = sb.ToString();
Performances
In order to help you better understand the performance difference between String and StringBuilder, I created the following example:
Stopwatch timer = new Stopwatch();
string str = string.Empty;
timer.Start();
for (int i = 0; i < 10000; i++) {
str += i.ToString();
}
timer.Stop();
Console.WriteLine("String : {0}", timer.Elapsed);
timer.Restart();
StringBuilder sbr = new StringBuilder(string.Empty);
for (int i = 0; i < 10000; i++) {
sbr.Append(i.ToString());
}
timer.Stop();
Console.WriteLine("StringBuilder : {0}", timer.Elapsed);
The output is
Output
String : 00:00:00.0706661
StringBuilder : 00:00:00.0012373
I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}