Read in a file using a regular expression? - c#

This is tangentially related to an earlier question of mine.
Essentially, the solution in that question worked great, but now I need to adapt it to work in a much larger analysis application. Simply using StreamReader.ReadToEnd() is not acceptable, since some of the files I will be reading in are very, very large. If there's been a mistake and someone forgot to clean up, they can theoretically be gigabytes big. Obviously I can't just read to the end of that.
Unfortunately, the normal read lines is also not acceptable, because some of the rows of data I am reading in contain stack traces - they obviously use /r/n in their formatting. Ideally, I would like to tell the program to read forward until it hits a match for a regex, which it then returns. Is there any functionality to do this in .net? If not, can I get some suggestions for how I'd go about writing it?
Edit: To make it a bit easier to follow my question, here's a paste of some of the important parts of the adapted code:
foreach (var fileString in logpath.Select(log => new StreamReader(log)).Select(fileStream => fileStream.ReadToEnd()))
{
const string junkPattern = #"\[(?<junk>[0-9]*)\] \((?<userid>.{0,32})\)";
const string severityPattern = #"INFO|ERROR|FATAL";
const string datePattern = "^(?=[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3})";
var records = Regex.Split(fileString, datePattern, RegexOptions.Multiline);
foreach (var record in records.Where(x => string.IsNullOrEmpty(x) == false))
......
The problem lies in the Foreach. .Select(fileStream => fileStream.ReadToEnd()) is gonna blow up memory badly, I just know it.

First off all, you should move your const definition to class declaration - the compiler will do that for you, but this should be done by yourself, just for better code readability.
As #Blam mentioned, you should use StringBuilder and StreamReader.ReadLine in pair, something like this:
foreach(var filePath in logpath)
{
var sbRecord = new StringBuilder();
using(var reader = new StreamReader(filePath))
{
do
{
var line = reader.ReadLine();
// check start of the new record lines
if (Regex.Match(line, datePattern) && sbRecord.Length > 0)
{
// your method for log record
HandleRecord(sbRecord.ToString());
sbRecord.Clear();
sbRecord.AppendLine(line);
}
// if no lines were added or datePattern didn't hit
// append info about current record
else
{
sbRecord.AppendLine(line);
}
} while (!reader.EndOfStream)
}
}
If I didn't understand something about your problem, please clarify this in comment.
Also, you can use ThreadPool for schedule the tasks for your lines, just for speed of your application.

Related

C# How to parse through an inconsistently formatted text file, ignoring unneeded information

A little background. I am new to using C# in a professional setting. My experience is mainly in SQL. I have a file that I need to parse through to pull out certain pieces of information. I can figure out how to parse through each line, but have gotten stuck on searching for specific pieces of information. I am not interested in someone finishing this code for me. Instead, I am interested in pointers on where I can go from here.
Here is an example of the code I have written.
class Program
{
private static Dictionary<string, List<string>> _arrayLists = new Dictionary<string, List<string>>();
static void Main(string[] args)
{
string filePath = "c:\\test.txt";
StreamReader reader = new StreamReader(filePath);
string line;
while (null !=(line = reader.ReadLine()))
{
if (line.ToLower().Contains("disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
}
else
{
if (line.ToLower().Contains("subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
}
}
}
}
}
A little bit of explanation of the file. I basically need to pull data existing between the word Subscribed and the first ; i come to. Specifically I need to take the values such as dvd = 234 and assign them to their same variables in the code. Not every record will have the same variables.
Here is an example of the text file that I need to parse through.
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info
Edit
My apologies on the vague question. I have to learn how to ask better questions.
My thought process was to make sure the program associated all the details between subscribed and the ; as one record. I think the part that I am confused on is in reading the lines. In my head I see the loop reading the line Subscribed, and then going into a method and reading the next line and assigning the value, and so on until it hits the ;. Once that was done I am trying to figure out how to tell the program to exit that method, but to continue reading from the line right after the semi-colon. Perhaps I am over thinking this.
I will take the advice I have been give and see what I can come up with to solve this. Thank you.
From you question as it is now it is not clear what specific problem you are struggling with. I'd suggest you edit your question providing specific challenges you'd like to overcome. currently you problem statement is "have gotten stuck on searching for specific pieces of information". This is as unspecific as it can get.
Having said that I'll try to help you.
First, you will never get into an if like that:
line.ToLower().Contains("Disconnected")
Here you convert all the characters to lower case, and then you are trying to find a substring with capital "D" in it. The expression above will (almost) always evaluate to false.
Secondly, in order for your application to do what you want to do it needs to track the current parsing state. I'm going to ignore the "Disconnected" bit now, as you have not shown what significance it has.
I'll be assuming that you are trying to find everything between Subscribed and first semicolon in the file. I'll also make a couple of other assumption regarding to what can constitute a string, which I won't list here. These can be wrong, but this is my best guess given the information you've provided.
You program will start in a state "looking for subscription". You already set up the read loop, which is good. In this loop you read lines of the file, and you find one that contains word Subscription.
Once you found such line your parser need to move to "parsing subscription" state. In this state, when you read lines you look for lines like jjd = 768, perhaps with a semicolon in the end. You can check if the line match a pattern by using Regular Expressions.
Regular Expressions also can divide match to capturing groups, so that you can extract the name (jjd) and the value (768) separately. Presences or absence of the semicolon could be another RegEx group.
Note that RegEx is not the only way to handle this, but this is the first that comes to mind.
You then keeping matching the lines to your regex and extracting names and values until you come across the semicolon, at which point you switch back to "looking for subscription" state.
You use the current state, to decide how to process the next read line.
You continue until the end of the file.
Generally you want to read up on parsing.
Hope this helps.
As with all code solutions to problems there are many possible ways to achieve what you are looking for. Some will work better then others. Below is one way that could help point you in the right direction.
You can check if the string starts with a keyword or value such as "dvd" (see MSDN String.StartsWith).
If it does then you can split the string into an array of parts (see MSDN String.Split).
You can then get the values of each part from the string array using the index of the value you want.
Do what you need to with the value retrieved.
Continue checking each line for your key business rules (ie. The semicolon that will end the section). Maybe you could check the last character of the string. (see String.EndsWith)
When processing text files containing semi-structured data, state variables can simplify the algorithm. In the code below, a boolean state variable isInRecord is used to track when a line is in a record.
using System;
using System.Collections.Generic;
using System.IO;
namespace ConsoleApplication19
{
public class Program
{
private readonly static String _testData = #"
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 453,
jhd = 567,
more annoying info
more annoying info
dxv = 456,
hft = 876;
more annoying info
test information
annoying information
Subscribed more annoying info
more annoying info
dvd = 234,
cls = 455,
more annoying info
more annoying info
dxv = 456,
hft = 876,
jjd = 768;
more annoying info
test information
annoying information
Disconnected more annoying info
more annoying info
more annoying info";
public static void Main(String[] args)
{
/* Create a temporary file containing the test data. */
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData), Path.GetRandomFileName());
File.WriteAllText(testFile, _testData);
try
{
var p = new Program();
var records = p.GetRecords(testFile);
foreach (var kvp in records)
{
Console.WriteLine("Record #" + kvp.Key);
foreach (var entry in kvp.Value)
{
Console.WriteLine(" " + entry);
}
}
}
finally
{
File.Delete(testFile);
}
}
private Dictionary<String, List<String>> GetRecords(String path)
{
var results = new Dictionary<String, List<String>>();
var recordNumber = 0;
var isInRecord = false;
using (var reader = new StreamReader(path))
{
String line;
while ((line = reader.ReadLine()) != null)
{
line = line.Trim();
if (line.StartsWith("Disconnected"))
{
// needs to continue on search for Disconnected or Subscribed
isInRecord = false;
}
else if (line.StartsWith("Subscribed"))
{
// program needs to continue reading file
// looking for and assigning values to
// dvd, cls, jhd, dxv, hft
// records start at Subscribed and end at ;
isInRecord = true;
recordNumber++;
}
else if (isInRecord)
{
// Check if the line has a general format of "something = something".
var parts = line.Split("=".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
if (parts.Length != 2)
continue;
// Update the relevant dictionary key, or add a new key.
List<String> entries;
if (results.TryGetValue(recordNumber.ToString(), out entries))
entries.Add(line);
else
results.Add(recordNumber.ToString(), new List<String>() { line });
// Determine if the isInRecord state variable should be toggled.
var lastCharacter = line[line.Length - 1];
if (lastCharacter == ';')
isInRecord = false;
}
}
}
return results;
}
}
}

Getting Certain Text From URL

I've been trying to do a couple things to this URL: "https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services"
First I need to parse out: writing-translation (Subject to change depending on the category). Then taking the '-' out of it so you would end up with: writing translation.
I've been trying myself with Regex, I am God AWFUL with it though, believe me I have trying though. If someone could give me an answer, and explain the Regex to me that they use, it would be awesome. Thank you so much.
i.e - my awful attempt (Just for the sake of it)
string MainCategory_link = firefoxDriver.FindElementByXPath("//a[#class='gig- sub-cat js-gtm-event-auto']").GetAttribute("href");
var Reg = new Regex("\".*?\"");
var matches = Reg.Matches(MainCategory_link);
foreach (var item in matches)
{
MessageBox.Show(item.ToString());
}
Updated code with segments attempt
string MainCategory_link = firefoxDriver.FindElementByXPath("//a[#class='gig-sub-cat js-gtm-event-auto']").GetAttribute("href");
var uri = new Uri(MainCategory_link);
foreach (var segment in uri.Segments)
{
MessageBox.Show(segment[1].ToString());
}
There is a Uri class that allows you to access different parts of the Uri via segments.
var uri = new Uri("https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services");
foreach(var segment in uri.Segments)
{
MessageBox.Show(segment);
}
/* Output
categories
writing-translation
SEO-keyword-optimization-services
*/
Therefore, to retrieve writing-translation you'd do:
var uri = new Uri("https://www.fiverr.com/categories/writing-translation/SEO-keyword-optimization-services");
MessageBox.Show(uri[1]);
And of course, you should perform bounds checks anytime you're accessing something via index to make sure it exists and not get an OutOfBoundsException.
Never ever use Regex unless you are absolutely positive a better option doesn't already exist. Regex should always be a last resort. In fact, it's probably better if you don't know Regex at all, because you'll just keep trying to use it at all the wrong times.

Extremely Large Single-Line File Parse

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!
That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(#"path\to\clinic.txt"))
using (var sw = new StreamWriter(#"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}
First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:
File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");
Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:
char[] buffer = new char[1024];
string remainder = string.Empty;
List<ClientData> list = new List<ClientData>();
using (StreamReader reader = File.OpenText(#"source.txt"))
{
while (reader.Read(buffer, 0, 1024) > 0)
{
remainder = Parse(remainder + new string(buffer), list);
}
}
with the following method:
string Parse(string value, List<ClientData> list)
{
string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length - 1; i++)
list.Add(new ClientData(parts[i]));
return parts[parts.Length - 1];
}
and the ClientData class however you have it implemented:
class ClientData
{
public ClientData(string value)
{
// fill in however you are already parsing out ID, and other info
}
}
There are many ways to implement something like this, but hopefully this can help get you started.
StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like.
http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx
Alternately, if you are reading an XML file, XmlReader is another option.
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx

Importing data files using generic class definitions

I am trying to import a file with multiple record definition in it. Each one can also have a header record so I thought I would define a definition interface like so.
public interface IRecordDefinition<T>
{
bool Matches(string row);
T MapRow(string row);
bool AreRecordsNested { get; }
GenericLoadClass ToGenericLoad(T input);
}
I then created a concrete implementation for a class.
public class TestDefinition : IRecordDefinition<Test>
{
public bool Matches(string row)
{
return row.Split('\t')[0] == "1";
}
public Test MapColumns(string[] columns)
{
return new Test {val = columns[0].parseDate("ddmmYYYY")};
}
public bool AreRecordsNested
{
get { return true; }
}
public GenericLoadClass ToGenericLoad(Test input)
{
return new GenericLoadClass {Value = input.val};
}
}
However for each File Definition I need to store a list of the record definitions so I can then loop through each line in the file and process it accordingly.
Firstly am I on the right track
or is there a better way to do it?
I would split this process into two pieces.
First, a specific process to split the file with multiple types into multiple files. If the files are fixed width, I have had a lot of luck with regular expressions. For example, assume the following is a text file with three different record types.
TE20110223 A 1
RE20110223 BB 2
CE20110223 CCC 3
You can see there is a pattern here, hopefully the person who decided to put all the record types in the same file gave you a way to identify those types. In the case above you would define three regular expressions.
string pattern1 = #"^TE(?<DATE>[0-9]{8})(?<NEXT1>.{2})(?<NEXT2>.{2})";
string pattern2 = #"^RE(?<DATE>[0-9]{8})(?<NEXT1>.{3})(?<NEXT2>.{2})";
string pattern3 = #"^CE(?<DATE>[0-9]{8})(?<NEXT1>.{4})(?<NEXT2>.{2})";
Regex Regex1 = new Regex(pattern1);
Regex Regex2 = new Regex(pattern2);
Regex Regex3 = new Regex(pattern3);
StringBuilder FirstStringBuilder = new StringBuilder();
StringBuilder SecondStringBuilder = new StringBuilder();
StringBuilder ThirdStringBuilder = new StringBuilder();
string Line = "";
Match LineMatch;
FileInfo myFile = new FileInfo("yourFile.txt");
using (StreamReader s = new StreamReader(f.FullName))
{
while (s.Peek() != -1)
{
Line = s.ReadLine();
LineMatch = Regex1.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex2.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
LineMatch = Regex3.Match(Line);
if (LineMatch.Success)
{
//Write this line to a new file
}
}
}
Next, take the split files and run them through a generic process, that you most likely already have, to import them. This works well because when the process inevitably fails, you can narrow it to the single record type that is failing and not impact all the record types. Archive the main text file along with the split files and your life will be much easier as well.
Dealing with these kinds of transmitted files is hard, because someone else controls them and you never know when they are going to change. Logging the original file as well as a receipt of the import is very import and shouldn't be overlooked either. You can make that as simple or as complex as you want, but I tend to write a receipt to a db and copy the primary key from that table into a foreign key in the table I have imported the data into, then never change that data. I like to keep a unmolested copy of the import on the file system as well as on the DB server because there are inevitable conversion / transformation issues that you will need to track down.
Hope this helps, because this is not a trivial task. I think you are on the right track, but instead of processing/importing each line separately...write them to a separate file. I am assuming this is financial data, which is one of the reasons I think provability at every step is important.
I think the FileHelpers library solves a number of your problems:
Strong types
Delimited
Fixed-width
Record-by-Record operations
I'm sure you could consolidate this into a type hierarchy that could tie in custom binary formats as well.
Have you looked at something using Linq? This is a quick example of Linq to Text and Linq to Csv.
I think it would be much simpler to use "yield return" and IEnumerable to get what you want working. This way you could probably get away with only having 1 method on your interface.

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#
Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.
Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}
I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}
Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))
Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.
1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.
What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.
If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

Categories

Resources