I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}
Related
This is a common question but I hope this does not get tagged as a duplicate since the nature of the question is different (please read the whole not only the title)
Unaware of the existence of String.Replace I wrote the following:
int theIndex = 0;
while ((theIndex = message.IndexOf(separationChar, theIndex)) != -1) //we found the character
{
theIndex++;
if (theIndex < message.Length)//not in the last position
{
message = message.Insert(theIndex, theTime);
}
else
{
// I dont' think this is really neccessary
break;
}
} //while finding characters
As you can see I am replacing occurrences of separationChar in the message String with a String called "theTime".
Now, this works ok for small strings but I have been given a really huge String (in the order of several hundred Kbytes- by the way is there a limit for String or StringBuilder??) and it takes a lot of time...
So my questions are:
1) Is it more efficient if I just do
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
2) Is there any other way I can process very long Strings to insert a String (theTime) when finding some char in a very fast and efficient way??
Thanks a lot
As Danny already mentioned, string.Insert() actually creates a new instance each time you use it, and these also have to be garbage collected at some point.
You could instead start with an empty StringBuilder to construct the result string:
public static string Replace(this string str, char find, string replacement)
{
StringBuilder result = new StringBuilder(str.Length); // initial capacity
int pointer = 0;
int index;
while ((index = str.IndexOf(find, pointer)) >= 0)
{
// Append the unprocessed data up to the character
result.Append(str, pointer, index - pointer);
// Append the replacement string
result.Append(replacement);
// Next unprocessed data starts after the character
pointer = index + 1;
}
// Append the remainder of the unprocessed data
result.Append(str, pointer, str.Length - pointer);
return result.ToString();
}
This will not cause a new string to be created (and garbage collected) for each occurrence of the character. Instead, when the internal buffer of the StringBuilder is full, it will create a new buffer chunk "of sufficient capacity". Quote from reference source, when its buffer is full:
Compute the length of the new block we need
We make the new chunk at least big enough for the current need (minBlockCharCount), but also as big as the current length (thus doubling capacity), up to a maximum
(so we stay in the small object heap, and never allocate really big chunks even if
the string gets really big).
Thank you for answering my question.
I am writing an answer because I have to report that I tried the solution in my question 1) and it is indeed more efficient according to the results of my program. String.Replace can replace a string(from a char) with another string very fast.
oldString=separationChar.ToString();
newString=oldString.Insert(theTime);
message= message.Replace(oldString,newString);
I know in C# we can always get the sub-array of a given array by using Array.Copy() method. However, this will consume more memory and processing time which is unnecessary in read-only situation. For example, I'm writing a heavy load network program which exchanges messages with other nodes in the cluster very frequently. The first 20 bytes of every message is the message header while the rest bytes make up the message body. Therefore, I will divide the received raw message into header byte array and body byte array in order to process them separately. However, this will obviously consume double memory and extra time. In C, we can easily use a pointer and assign offset to it to access different parts of the array.
For instance, in C language, if we have a char a[] = "ABCDEFGHIJKLMN", we can declare a char* ptr = a + 3 to represent the array DEFGHIJKLMN.
Is there a way to accomplish this in C#?
You might be interested in ArraySegments or unsafe.
ArraySegments delimits a section of a one-dimensional array.
Check ArraySegments in action
ArraySegments usage example:
int[] array = { 10, 20, 30 };
ArraySegment<int> segment = new ArraySegment<int>(array, 1, 2);
// The segment contains offset = 1, count = 2 and range = { 20, 30 }
Unsafe define an unsafe context in which pointers can be used.
Unsafe usage example:
int[] a = { 4, 5, 6, 7, 8 };
unsafe
{
fixed (int* c = a)
{
// use the pointer
}
}
First of all you must consider this as a premature optimization.
But you may use several ways to reduce memory consumption, if you sure you really need it:
1) You may use Flyweight pattern https://en.wikipedia.org/wiki/Flyweight_pattern to pool duplicated resources.
2) You may try to use unsafe directive and manual pointer management.
3) You may just switch to C for this functionality and just call native code from your C# program.
From my experience memory consumption for short-lived objects is not a big problem and I'd just write code with flyweight pattern and profile application afterwards.
Assuming you have a Message wrapper class in C#? Why not just add a property on it called header that returns the top 20 bytes.
You can easily accomplish this using skip and take suggested by Jonathon Reinhart above if you have the entire initial array in a memory array, but it sounds like you may have it in a network stream, which means the property might be a little more involved by doing a read of the initial 20 bytes from the the stream.
Something along the lines of:
class Message
{
private readonly Stream _stream;
private byte[] _inMemoryBytes;
public Message(Stream stream)
{
_stream = stream;
}
public IEnumerable<byte> Header
{
get
{
if (_inMemoryBytes.Length >= 20)
return _inMemoryBytes.Take(20);
_stream.Read(_inMemoryBytes, 0, 20);
return _inMemoryBytes.Take(20);
}
}
public IEnumerable<byte> FullMessage
{
get
{
// Read and return the whole message. You might want amend to data already read.
}
}
}
I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!
That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(#"path\to\clinic.txt"))
using (var sw = new StreamWriter(#"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}
First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:
File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");
Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:
char[] buffer = new char[1024];
string remainder = string.Empty;
List<ClientData> list = new List<ClientData>();
using (StreamReader reader = File.OpenText(#"source.txt"))
{
while (reader.Read(buffer, 0, 1024) > 0)
{
remainder = Parse(remainder + new string(buffer), list);
}
}
with the following method:
string Parse(string value, List<ClientData> list)
{
string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length - 1; i++)
list.Add(new ClientData(parts[i]));
return parts[parts.Length - 1];
}
and the ClientData class however you have it implemented:
class ClientData
{
public ClientData(string value)
{
// fill in however you are already parsing out ID, and other info
}
}
There are many ways to implement something like this, but hopefully this can help get you started.
StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like.
http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx
Alternately, if you are reading an XML file, XmlReader is another option.
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx
I want to know how return values for strings works for strings in C#. In one of my functions, I generate html code and the string is really huge, I then return it from the function, and then insert it into the page. But I want to know should I pass a huge string as a return value, or just insert it into the page from the same function?
When C# returns a string, does it create a new string from the old one, and return that?
Thanks.
Strings (or any other reference type) are not copied when returning from a function, only value types are.
System.String is a reference type (class) and so passing as parameter and returning only involve the copying of a reference (32 or 64 bits).
The size of the string is not relevant.
Returning a string is a cheap operation - as mentioned it's purely a matter of returning 32 or 64 bits (4 or 8 bytes).
However, as Sten Petrov points out string + operations involve the creation of a new string, and can be a little expensive. If you wanted to save performance & memory I'd suggest doing something like this:
static int i = 0;
static void Main(string[] args)
{
while (Console.ReadLine() == "")
{
var pageSB = new StringBuilder();
foreach (var section in new[] { AddHeader(), AddContent(), AddFooter() })
for (int i = 0; i < section.Length; i++)
pageSB.Append(section[i]);
Console.Write(pageSB.ToString());
}
}
static StringBuilder AddHeader()
{
return new StringBuilder().Append("Hi ").AppendLine("World");
}
static StringBuilder AddContent()
{
return new StringBuilder()
.AppendFormat("This page has been viewed: {0} times\n", ++i);
}
static StringBuilder AddFooter()
{
return new StringBuilder().Append("Bye ").AppendLine("World");
}
Here we use the StringBuilders to hold a reference to all the strings we want to concat, and wait until the very end before joining them together. This'll save many unnecessary additions (which are memory and CPU heavy in comparison).
Of course, I doubt you'll actually see any need for this in practise - and if you do I'd spend some time learning about pooling etc. to help reduce the garbage created by all the string builders - and maybe consider creating a custom 'string holder' that suits your purposes better.
Basically I use Entity Framework to query a huge database. I want to return a string list then log it to a text file.
List<string> logFilePathFileName = new List<string>();
var query = from c in DBContext.MyTable where condition = something select c;
foreach (var result in query)
{
filePath = result.FilePath;
fileName = result.FileName;
string temp = filePath + "." + fileName;
logFilePathFileName.Add(temp);
if(logFilePathFileName.Count %1000 ==0)
Console.WriteLine(temp+"."+logFilePathFileName.Count);
}
However I got an exception when logFilePathFileName.Count=397000.
The exception is:
Exception of type 'System.OutOfMemoryException' was thrown.
A first chance exception of type 'System.OutOfMemoryException'
occurred in System.Data.Entity.dll
UPDATE:
What I want to use a different query say: select top 1000 then add to the list, but I don't know after 1000 then what?
Most probabbly it's not about a RAM as is, so increasing your RAM or even compiling and running your code in 64 bit machine will not have a positive effect, in this case.
I think it's related to a fact that .NET collections are limited to maximum 2GB RAM space (no difference either 32 or 64 bit).
To resolve this, split your list to much smaller chunks and most probabbly your problem will gone.
Just one possible solution:
foreach (var result in query)
{
....
if(logFilePathFileName.Count %1000 ==0) {
Console.WriteLine(temp+"."+logFilePathFileName.Count);
//WRITE SOMEWHERE YOU NEED
logFilePathFileName = new List<string>(); //RESET LIST !|
}
}
EDIT
If you want fragment a query, you can use Skip(...) and Take(...)
Just an explanatory example:
var fisrt1000 = query.Skip(0).Take(1000);
var second1000 = query.Skip(1000).Take(1000);
...
and so on..
Naturally put it in your iteration and parametrize it based on bounds of data you know or need.
Why are you collecting the data in a List<string> if all you need to do is write it to a text file?
You might as well just:
Open the text file;
Iterate over the records, appending each string to the text file (without storing the strings in memory);
Flush and close the text file.
You will need far less memory than now, because you won't be keeping all those strings unnecessarily in memory.
You probably need to set some vmargs for memory!
Also... look into writing it straight to your file and not holding it in a List
What Roy Dictus says sounds the best way.
Also you can try to add a limit to your query. So your database result won't be so large.
For info on:
Limiting query size with entity framework
You shouldn't read all records from database to list. It required a lot of memory. You an combine reading records and writing them to file. For example read 1000 records from db to list and save(append) them to text file, clear used memory (list.Clear()) and continue with new records.
From several other topics on StackOverflow I read that the Entity Framework is not designed to handle bulk data like that. The EF will cache/track all data in the context and will cause the exception in cases of huge bulks of data. Options are to use SQL directly or split up your records in smaller sets.
I used to use the gc arraylist in VS c++ similar to the gc List that you used, to works fin with small and intermediate data sets, but when using Big Dat, same problem 'System.OutOfMemoryException' was thrown.
As the size of these gcs cannot exceed 2 GB and therefore become inefficient with Big data, I built my own linked list, which gives the same functionality, dynamic increase and get by index, basically, it is a normal linked list class, with a dynamic array inside to provide getting data by index, it duplicates the space, but you may delete the linked list after updating the array is you do not need it keeping only the dynamic array, this would solve the problem. see the code:
struct LinkedNode
{
long data;
LinkedNode* next;
};
class LinkedList
{
public:
LinkedList();
~LinkedList();
LinkedNode* head;
long Count;
long * Data;
void add(long data);
void update();
//long get(long index);
};
LinkedList::LinkedList(){
this->Count = 0;
this->head = NULL;
}
LinkedList::~LinkedList(){
LinkedNode * temp;
while(head){
temp= this->head ;
head = head->next;
delete temp;
}
if (Data)
delete [] Data; Data=NULL;
}
void LinkedList::add (long data){
LinkedNode * node = new LinkedNode();
node->data = data;
node->next = this->head;
this->head = node;
this->Count++;}
void LinkedList::update(){
this->Data= new long[this->Count];
long i = 0;
LinkedNode * node =this->head;
while(node){
this->Data[i]=node->data;
node = node->next;
i++;
}
}
If you use this, please refer to my work https://www.liebertpub.com/doi/10.1089/big.2018.0064