Extremely Large Single-Line File Parse

Extremely Large Single-Line File Parse - c#

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!

That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(#"path\to\clinic.txt"))
using (var sw = new StreamWriter(#"path\to\output.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '\r') || (c == '\n'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}

First off, I don't think you need to put all the text in a StringBuilder, since you aren't even concatenating parts to it. You could just try the following:
File.ReadAllText(filepath).Replace("\n", "").Replace("<ClinicalData", "\n<ClinicalData");
Why not try a StreamReader for this task? You can pick a "chunk" size that you want to read by and then split up those chunks into the (ClinicalData)data(/ClinicalData) parts. Here is some detailed code on how to do this:
char[] buffer = new char[1024];
string remainder = string.Empty;
List<ClientData> list = new List<ClientData>();
using (StreamReader reader = File.OpenText(#"source.txt"))
{
while (reader.Read(buffer, 0, 1024) > 0)
{
remainder = Parse(remainder + new string(buffer), list);
}
}
with the following method:
string Parse(string value, List<ClientData> list)
{
string[] parts = value.Split(new string[1] { "</ClientData>" }, StringSplitOptions.None);
for (int i = 0; i < parts.Length - 1; i++)
list.Add(new ClientData(parts[i]));
return parts[parts.Length - 1];
}
and the ClientData class however you have it implemented:
class ClientData
{
public ClientData(string value)
{
// fill in however you are already parsing out ID, and other info
}
}
There are many ways to implement something like this, but hopefully this can help get you started.
StreamReader's ReadLine() method is only one of the many ways you can read the text from the file. You can read into a buffer with a specified length, and then parse out the ClinicalData tags. I can provide an example if you'd like.
http://msdn.microsoft.com/en-us/library/9kstw824%28v=vs.110%29.aspx
Alternately, if you are reading an XML file, XmlReader is another option.
http://msdn.microsoft.com/en-us/library/system.xml.xmlreader%28v=vs.110%29.aspx

Related

read a text file and search for string in memory efficient way (and abort when found)

I'm searching for a string in a text file (also includes XML). This is what I thought first:
using (StreamReader sr = File.OpenText(fileName))
{
string s = String.Empty;
while ((s = sr.ReadLine()) != null)
{
if (s.Contains("mySpecialString"))
return true;
}
}
return false;
I want to read line by line to minimize the amount of RAM used. When the string has been found it should abort the operation. The reason why I don't process it as XML is because it has to be parsed and would also consume more memory as necessary.
Another easy implementation would be
bool found = File.ReadAllText(path).Contains("mySpecialString") ? true : false;
but that would read the complete file into memory, which isn't what I want. On the other side it could have a performance increase.
Another one would be this
foreach (string line in File.ReadLines(path))
{
if (line.Contains("mySpecialString"))
{
return true;
}
}
return false;
But which one of them (or another one from you?) is more memory efficient?

You can use a query with File.ReadLines, so it only reads as many lines as it needs to, in order to satisfy your query. The Any() method will stop when it hits a line containing your string.
return File.ReadLines(fileName).Any(line => line.Contains("mySpecialString"));

I also prefer the accepted answer. Maybe i'm micro opimizing things here but you have asked for a memory efficient approach. Also consider that the text you are searching could also contain new-line characters like '\r', '\n' or "\r\n" and a large file could theoretically contain a single line which negates the benefit of ReadLines.
So you could use this method:
public static bool FileContainsString(string path, string str, bool caseSensitive = true)
{
if(String.IsNullOrEmpty(str))
return false;
using (var stream = new StreamReader(path))
while (!stream.EndOfStream)
{
bool stringFound = true;
for (int i = 0; i < str.Length; i++)
{
char strChar = caseSensitive ? str[i] : Char.ToUpperInvariant(str[i]);
char fileChar = caseSensitive ? (char)stream.Read() : Char.ToUpperInvariant((char)stream.Read());
if (strChar != fileChar)
{
stringFound = false;
break; // break for-loop, start again with first character at next position
}
}
if (stringFound)
return true;
}
return false;
}
bool containsString = FileContainsString(path, "mySpecialString", false); // ignore case if desired
Note that this might be the most efficient approach and hidden in a method also readable. But it has one drawback, it's not feasible to implement a culture-sensitive comparison because it looks at single characters and not at substrings.
So you have to keep some edge cases in mind where you can run into issues, like the famous turkish i example or surrogate pairs.

I think both of your solutions are the same. Read at the MSDN: https://msdn.microsoft.com/en-us/library/dd383503%28v=vs.110%29.aspx
There it says: "The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned"
The same article also suggests that ReadLines should be used in conjunction with LINQ to Objects.

Can't find string in input file

I have a text file, which I am trying to insert a line of code into. Using my linked-lists I believe I can avoid having to take all the data out, sort it, and then make it into a new text file.
What I did was come up with the code below. I set my bools, but still it is not working. I went through debugger and what it seems to be going on is that it is going through the entire list (which is about 10,000 lines) and it is not finding anything to be true, so it does not insert my code.
Why or what is wrong with this code?
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
using (StreamReader inFile = new StreamReader("Students.txt", true))
{
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i];
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.CompareTo(lastName) < 0)
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
{
lines.Add(newRecord);
}

You're just reading the file into memory and not committing it anywhere.
I'm afraid that you're going to have to load and completely re-write the entire file. Files support appending, but they don't support insertions.
you can write to a file the same way that you read from it
string[] lines;
/// instanciate and build `lines`
File.WriteAllLines("path", lines);
WriteAllLines also takes an IEnumerable, so you can past a List of string into there if you want.
one more issue: it appears as though you're reading your file twice. one with ReadAllLines and another with your StreamReader.

There are at least four possible errors.
The opening of the streamreader is not required, you have already read
all the lines. (Well not really an error, but...)
The check for StartsWith can be fooled if you lines starts with blank
space and you will miss the insertionPoint. (Adding a Trim will remove any problem here)
In the CompareTo line you check for < 0 but you should check for == 0. CompareTo returns 0 if the strings are equivalent, however.....
To check if two string are equals you should avoid using CompareTo as
explained in MSDN link above but use string.Equals
List<string> lines = new List<string>(File.ReadAllLines("Students.txt"));
string newLastName = "'Constant";
string newRecord = "(LIST (LIST 'Constant 'Malachi 'D ) '1234567890 'mdcant#mail.usi.edu 4.000000 )";
string line;
string lastName;
bool insertionPointFound = false;
for (int i = 0; i < lines.Count && !insertionPointFound; i++)
{
line = lines[i].Trim();
if (line.StartsWith("(LIST (LIST "))
{
values = line.Split(" ".ToCharArray());
lastName = values[2];
if (newLastName.Equals(lastName))
{
lines.Insert(i, newRecord);
insertionPointFound = true;
}
}
}
if (!insertionPointFound)
lines.Add(newRecord);
I don't list as an error the missing write back to the file. Hope that you have just omitted that part of the code. Otherwise it is a very simple problem.
(However I think that the way in which CompareTo is used is probably the main reason of your problem)
EDIT Looking at your comment below it seems that the answer from Sam I Am is the right one for you. Of course you need to write back the modified array of lines. All the changes are made to an in memory array of lines and nothing is written back to a file if you don't have code that writes a file. However you don't need new file
File.WriteAllLines("Students.txt", lines);

Reading a line in c# without trimming the line delimiter character

I've got a string that I want to read line-by-line, but I also need to have the line delimiter character, which StringReader.ReadLine unfortunately trims (unlike in ruby where it is kept). What is the fastest and most robust way to accomplish this?
Alternatives I've been thinking about:
Reading the input character-by-character and checking for the line delimiter each time
Using RegExp.Split with a positive lookahead
Alternatively I only care about the line delimiter because I need to know the actual position in the string, and the delimiter can be either one or tho character long. Therefore if I could get back the actual position of the cursor within the string would be also good, but StringReader doesn't have this feature.
EDIT: here is my current implementation. End-of-file is designated by returning an empty string.
StringBuilder line = new StringBuilder();
int r = _input.Read();
while (r >= 0)
{
char c = Convert.ToChar(r);
line.Append(c);
if (c == '\n') break;
if (c == '\r')
{
int peek = _input.Peek();
if (peek == -1) break;
if (Convert.ToChar(peek) != '\n') break;
}
r = _input.Read();
}
return line.ToString();

Are you concerned about inconsistencies between files (i.e. coming from Unix/Mac vs. Windows), or within files?
One very easy optimization if you know that individual files are consistent with themselves would be to only read the first line character-by-character and figure out what the delimiter is. Then determining the exact position of any other line would be simple math.
Failing that, I think I would go the character-by-character route. A regex seems too "clever." This sounds like a complex function and I think the most important thing would be to make it easy to write, read, understand, and most importantly debug.
There's another way to do this, which would be more efficient if your data source was a stream. Unfortunately it's not, as referenced in your comment, so you would have to create one first; however, I'll include the solution anyway, it might give you some inspiration:
public IEnumerable<int> GetLineStartIndices(string s)
{
yield return 0;
byte[] chars = Encoding.UTF8.GetBytes(s);
using (MemoryStream stream = new MemoryStream(chars))
{
using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
{
while (reader.ReadLine() != null)
{
yield return stream.Position;
}
}
}
}
This will give you back the start position of each new line. Obviously you can tweak this to do whatever else you need, i.e. do something else with the actual lines you read.
Just note that this has to make a copy of the string to create the byte array, so it's really not suitable for very large strings. It's a bit nicer than the char-by-char approach though, less bug-prone, so perhaps worth considering if the strings are not megabytes-long.

If you only care about the position: ReadLine() moves you to the next line. If you store the .Position of the stream underneath you can compare it to the .Position after the following ReadLine(). That's the length of the string you just read plus the delimiter.
Length of the delimiter is currentPosition - previousPosition - line.Length.
That way you could easily find out if it was 1 or 2 bytes (without knowing the details, but you said you care only about the positions anyway).

File.ReadAllText will get you all of the file contents. Yup. All. So you better check that file size before using it.
EDIT:
read it all in then create an enumerator that yields line by line.
foreach(string line in Read("some.file"))
{ ... }
private IEnumerator Read(string file)
{
string buffer = File.ReadAllText()
for (int index=0;index<buffer.length;index++)
{
string line = ... logic to build a "line" here
yield return line;
}
yield break;
}

FileStream fs = new FileStream("E:\\hh.txt", FileMode.Open, FileAccess.Read);
BinaryReader read = new BinaryReader(fs);
byte[] ch = read.ReadBytes((int)fs.Length);
byte[] che=new byte[(int)fs.Length];
int size = (int)fs.Length,j=0;
for ( int i =0; i <= (size-1); i++)
{
if (ch[i] != '|')
{
che[j] = ch[i];
j++;
}
}
richTextBox1.Text = Encoding.ASCII.GetString(che);
read.Close();
fs.Close();

How to optimize memory usage in this algorithm?

I'm developing a log parser, and I'm reading files of strings of more than 150MB.- This is my approach, Is there any way to optimize what is in the While statement? The problem is that is consuming a lot of memory.- I also tried with a stringbuilder facing the same memory comsuption.-
private void ReadLogInThread()
{
string lineOfLog = string.Empty;
try
{
StreamReader logFile = new StreamReader(myLog.logFileLocation);
InformationUnit infoUnit = new InformationUnit();
infoUnit.LogCompleteSize = myLog.logFileSize;
while ((lineOfLog = logFile.ReadLine()) != null)
{
myLog.transformedLog.Add(lineOfLog); //list<string>
myLog.logNumberLines++;
infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
infoUnit.CurrentLine = lineOfLog;
infoUnit.CurrentSizeRead += lineOfLog.Length;
if (onLineRead != null)
onLineRead(infoUnit);
}
}
catch { throw; }
}
Thanks in advance!
EXTRA:
Im saving each line because after reading the log I will need to check for some information on every stored line.- The language is C#

Memory economy can be achieved if your log lines are actually can be parsed to a data row representation.
Here is a typical log line i can think of:
Event at: 2019/01/05:0:24:32.435, Reason: Operation, Kind: DataStoreOperation, Operation Status: Success
This line takes 200 bytes in memory.
At the same time, following representation just takes belo 16 bytes:
Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };
LogRow
{
DateTime EventTime;
LogReason Reason;
EventKind Kind;
OperationStatus Status;
}
Another optimization possibility is just parsing a line to array of string tokens,
this way you could make use of string interning.
For example, if a word "DataStoreOperation" takes 36 bytes, and if it has 1000000 entiries in the file, the economy is (18*2 - 4) * 1000000 = 32 000 000 bytes.

Try to make your algorithm sequential.
Using an IEnumerable instead of a List helps playing nice with memory, while keeping same semantic as working with a list, if you don't need random access to lines by index in the list.
IEnumerable<string> ReadLines()
{
// ...
while ((lineOfLog = logFile.ReadLine()) != null)
{
yield return lineOfLog;
}
}
//...
foreach( var line in ReadLines() )
{
ProcessLine(line);
}

I am not sure if it will fit your project but you can store the result in StringBuilder instead of strings list.
For example, this process on my machine takes 250MB memory after loading (file is 50MB):
static void Main(string[] args)
{
using (StreamReader streamReader = File.OpenText("file.txt"))
{
var list = new List<string>();
string line;
while (( line=streamReader.ReadLine())!=null)
{
list.Add(line);
}
}
}
On the other hand, this code process will take only 100MB:
static void Main(string[] args)
{
var stringBuilder = new StringBuilder();
using (StreamReader streamReader = File.OpenText("file.txt"))
{
string line;
while (( line=streamReader.ReadLine())!=null)
{
stringBuilder.AppendLine(line);
}
}
}

Memory usage keeps going up because you're simply adding them to a List<string>, constantly growing. If you want to use less memory one thing you can do is to write the data to disk, rather than keeping it in scope. Of course, this will greatly cause speed to degrade.
Another option is to compress the string data as you're storing it to your list, and decompress it coming out but I don't think this is a good method.
Side Note:
You need to add a using block around your streamreader.
using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

Consider this implementation: (I'm speaking c/c++, substitute c# as needed)
Use fseek/ftell to find the size of the file.
Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.
Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a
string.
Create a vector of const char * to hold pointers to the positions
in memory where each line can be found. Initialize the first element
of the vector to the first byte of the memory buffer.
Find the carriage control characters (probably \r\n) Replace the
\r by \0 to make the line a string. Increment past the \n.
This new pointer location is pushed back onto the vector.
Repeat the above until all of the lines in the file have been NUL
terminated, and are pointed to by elements in the vector.
Iterate though the vector as needed to investigate the contents of
each line, in your business specific way.
When you are done, close the file, free the memory, and continue
happily along your way.

1) Compress the strings before you store them (ie see System.IO.Compression and GZipStream). This would probably kill the performance of your program though since you'd have to uncompress to read each line.
2) Remove any extra white space characters or common words you can do without. ie if you can understand what the log is saying with the words "the, a, of...", remove them. Also, shorten any common words (ie change "error" to "err" and "warning" to "wrn"). This would slow down this step in the process but shouldn't affect performance of the rest.

What encoding is your original file? If it is ascii then just the strings alone are going to take over 2x the size of the file just to load up into your array. A C# character is 2 bytes and a C# string adds an extra 20 bytes per string in addition to the characters.
In your case, since it is a log file, you can probably exploit the fact that there is a lot of repetition in the the messages. You most likely can parse the incoming line into a data structure which reduces the memory overhead. For example, if you have a timestamp in the log file you can convert that to a DateTime value which is 8 bytes. Even a short timestamp of 1/1/10 would add 12 bytes to the size of a string, and a timestamp with time information would be even longer. Other tokens in your log stream might be able to be turned into a code or an enum in a similar manner.
Even if you have the leave the value as a string, if you can break it down into pieces that are used a lot, or remove boilerplate that is not needed at all you can probably cut down on your memory usage. If there are a lot of common strings you can Intern them and only pay for 1 string no matter how many you have.

If you must store the raw data, and assuming that your logs are mostly ASCII, then you can save some memory by storing UTF8 bytes internally. Strings are UTF16 internally, so you're storing an extra byte for each character. So by switching to UTF8 you're cutting memory use by half (not counting class overhead, which is still significant). Then you can convert back to normal strings as needed.
static void Main(string[] args)
{
List<Byte[]> strings = new List<byte[]>();
using (TextReader tr = new StreamReader(#"C:\test.log"))
{
string s = tr.ReadLine();
while (s != null)
{
strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
s = tr.ReadLine();
}
}
// Get strings back
foreach( var str in strings)
{
Console.WriteLine(Encoding.UTF8.GetString(str));
}
}

c# how do I count lines in a textfile

any problems with doing this?
int i = new StreamReader("file.txt").ReadToEnd().Split(new char[] {'\n'}).Length

The method you posted isn't particularly good. Lets break this apart:
// new StreamReader("file.txt").ReadToEnd().Split(new char[] {'\n'}).Length
// becomes this:
var file = new StreamReader("file.txt").ReadToEnd(); // big string
var lines = file.Split(new char[] {'\n'}); // big array
var count = lines.Count;
You're actually holding this file in memory twice: once to read all the lines, once to split it into an array. The garbage collector hates that.
If you like one liners, you can write System.IO.File.ReadAllLines(filePath).Length, but that still retrieves the entire file in an array. There's no point doing that if you aren't going to hold onto the array.
A faster solution would be:
int TotalLines(string filePath)
{
using (StreamReader r = new StreamReader(filePath))
{
int i = 0;
while (r.ReadLine() != null) { i++; }
return i;
}
}
The code above holds (at most) one line of text in memory at any given time. Its going to be efficient as long as the lines are relatively short.

Well, the problem with doing this is that you allocate a lot of memory when doing this on large files.
I would rather read the file line by line and manually increment a counter. This may not be a one-liner but it's much more memory-efficient.
Alternatively, you may load the data in even-sized chunks and count the line breaks in these. This is probably the fastest way.

If you're looking for a short solution, I can give you a one-liner that at least saves you from having to split the result:
int i = File.ReadAllLines("file.txt").Count;
But that has the same problems of reading a large file into memory as your original. You should really use a streamreader and count the line breaks as you read them until you reach the end of the file.

Sure - it reads the entire stream into memory. It's terse, but I can create a file today that will fail this hard.
Read a character at a time and increment your count on newline.
EDIT - after some quick research
If you want terse and want that shiny new generic feel, consider this:
public class StreamEnumerator : IEnumerable<char>
{
StreamReader _reader;
public StreamEnumerator(Stream stm)
{
if (stm == null)
throw new ArgumentNullException("stm");
if (!stm.CanSeek)
throw new ArgumentException("stream must be seekable", "stm");
if (!stm.CanRead)
throw new ArgumentException("stream must be readable", "stm");
_reader = new StreamReader(stm);
}
public IEnumerator<char> GetEnumerator()
{
int c = 0;
while ((c = _reader.Read()) >= 0)
{
yield return (char)c;
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
which defines a new class which allows you to enumerate over streams, then your counting code can look like this:
StreamEnumerator chars = new StreamEnumerator(stm);
int lines = chars.Count(c => c == '\n');
which gives you a nice terse lambda expression to do (more or less) what you want.
I still prefer the Old Skool:
public static int CountLines(Stream stm)
{
StreamReader _reader = new StreamReader(stm);
int c = 0, count = 0;
while ((c = _reader.Read()) != -1)
{
if (c == '\n')
{
count++;
}
}
return count;
}
NB: Environment.NewLine version left as an exercise for the reader

That should do the trick:
using System.Linq;
....
int i = File.ReadLines(file).Count();

Mayby this?
string file = new StreamReader("YourFile.txt").ReadToEnd();
string[] lines = file.Split('\n');
int countOfLines = lines.GetLength(0));

Assuming the file exists and you can open it, that will work.
It's not very readable or safe...

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extremely Large Single-Line File Parse - c#

Related

read a text file and search for string in memory efficient way (and abort when found)

Can't find string in input file

Reading a line in c# without trimming the line delimiter character

How to optimize memory usage in this algorithm?

c# how do I count lines in a textfile

Categories

Resources