Parse StreamReader using regex efficiently

Parse StreamReader using regex efficiently - c#

I have the variable
StreamReader DebugInfo = GetDebugInfo();
var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!! because there are a lot of students
text equals:
<student>
<firstName>Antonio</firstName>
<lastName>Namnum</lastName>
</student>
<student>
<firstName>Alicia</firstName>
<lastName>Garcia</lastName>
</student>
<student>
<firstName>Christina</firstName>
<lastName>SomeLattName</lastName>
</student>
... etc
.... many more students
what am I doing now is:
StreamReader DebugInfo = GetDebugInfo();
var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!
var mtch = Regex.Match(text , #"(?s)<student>.+?</student>");
// keep parsing the file while there are more students
while (mtch.Success)
{
AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
mtch = mtch.NextMatch();
}
the whole process takes about 25 seconds. to convert the streamReader to text (var text = DebugInfo.ReadToEnd();) that takes 10 seconds. the other part takes about 15 seconds. I was hoping I could do the two parts at the same time...
EDIT
I will like to have something like:
const int bufferSize = 1024;
var sb = new StringBuilder();
Task.Factory.StartNew(() =>
{
Char[] buffer = new Char[bufferSize];
int count = bufferSize;
using (StreamReader sr = GetUnparsedDebugInfo())
{
while (count > 0)
{
count = sr.Read(buffer, 0, bufferSize);
sb.Append(buffer, 0, count);
}
}
var m = sb.ToString();
});
Thread.Sleep(100);
// meanwhile string is being build start adding items
var mtch = Regex.Match(sb.ToString(), #"(?s)<student>.+?</student>");
// keep parsing the file while there are more nodes
while (mtch.Success)
{
AddStudent(mtch.Value);
mtch = mtch.NextMatch();
}
Edit 2
Summary
I forgot to mention sorry the text is very similar to xml but it is not. That's why I have to use regular expressions... In short I think I could save time because what am I doing is converting the stream to a string then parsing the string. why not just parse the stream with a regex. Or if that is not possible why not get a chunk of the stream and parse that chunk in a separate thread.

UPDATED:
This basic code reads a (roughly) 20 megabyte file in .75 seconds. My machine should roughly process 53.33 megabytes in that 2 seconds that you reference. Further, 20,000,000 / 2,048 = 9765.625. .75 / 9765.625 = .0000768. That means that you are roughly reading 2048 characters every 768 hundred-thousandths of a second. You need to understand the cost of context switching in relation to the timing of your iterations to determine whether the added complexity of multi-threading is appropriate. At 7.68X10^5 seconds, I see your reader thread sitting idle most of the time. It doesn't make sense to me. Just use a single loop with a single thread.
char[] buffer = new char[2048];
StreamReader sr = new StreamReader(#"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
; // do nothing
}
For large operations like this, you want to use a forward-only, non-cached reader. It looks like your data is XML, so an XmlTextReader is perfect for this. Here is some sample code. Hope this helps.
string firstName;
string lastName;
using (XmlTextReader reader = GetDebugInfo())
{
while (reader.Read())
{
if (reader.IsStartElement() && reader.Name == "student")
{
reader.ReadToDescendant("firstName");
reader.Read();
firstName = reader.Value;
reader.ReadToFollowing("lastName");
reader.Read();
lastName = reader.Value;
AddStudent(firstName, lastName);
}
}
}
I used the following XML:
<students>
<student>
<firstName>Antonio</firstName>
<lastName>Namnum</lastName>
</student>
<student>
<firstName>Alicia</firstName>
<lastName>Garcia</lastName>
</student>
<student>
<firstName>Christina</firstName>
<lastName>SomeLattName</lastName>
</student>
</students>
You may need to tweak. This should run much, much faster.

You can read line-by-line, but if reading of the data takes 15 seconds there is not much you can do to speed things up.
Before making any significant changes try to simply read all lines of the file and do no processing. If that still takes longer that your goal - adjust goals/change file format. Otherwise see how much gains you can expect from optimizing parsing - RegEx are quite fast for non-complicated regular expressions.

RegEx isn't the fastest way to parse a string. You need a tailored parser similar to XmlReader (to match your data structure). It will allow you to read the file partially and parse it much faster than RegEx does.
Since you have a limited set of tags and nesting FSM approach (http://en.wikipedia.org/wiki/Finite-state_machine) will work for you.

Here is what turn out to be the fastest (maybe I mist more things to try)
Created an array of arrays char[][] listToProcess = new char[200000][]; where I will place chunks of the stream. On a separate task I started to process each chunk. The code looks like:
StreamReader sr = GetUnparsedDebugInfo(); // get streamReader
var task1 = Task.Factory.StartNew(() =>
{
Thread.Sleep(500); // wait a little so there are items on list (listToProcess) to work with
StartProcesingList();
});
int counter = 0;
while (true)
{
char[] buffer = new char[2048]; // crate a new buffer each time we will add it to the list to process
var charsRead = sr.Read(buffer, 0, buffer.Length);
if (charsRead < 1) // if we reach the end then stop
{
break;
}
listToProcess[counter] = buffer;
counter++;
}
task1.Wait();
and the method StartProcesingList() basically starts going through the list until it reaches a null object.
void StartProcesingList()
{
int indexOnList = 0;
while (true)
{
if (listToProcess[indexOnList] == null)
{
Thread.Sleep(100); // wait a little in case other thread is adding more items to the list
if (listToProcess[indexOnList] == null)
break;
}
// add chunk to dictionary if you recall listToProcess[indexOnList] is a
// char array so it basically converts that to a string and splits it where appropiate
// there is more logic as in the case where the last chunk will have to be
// together with the first chunk of the next item on the list
ProcessChunk(listToProcess[indexOnList]);
indexOnList++;
}
}

#kakridge was right. I could be dealing with a race condition where one task is writing listToProces[30] for example and another thread could be parsing listToProces[30]. To fix that problem and also to remove the Thread.Sleep methods that are ineficient I ended up using semaphores. Here is my new code:
StreamReader unparsedDebugInfo = GetUnparsedDebugInfo(); // get streamReader
listToProcess = new char[200000][];
lastPart = null;
matchLength = 0;
// Used to signal events between thread that is reading text
// from readelf.exe and the thread that is parsing chunks
Semaphore semaphore = new Semaphore(0, 1);
// If task1 run out of chunks to process it will be waiting for semaphore to post a message
bool task1IsWaiting = false;
// Used to note that there are no more chunks to add to listToProcess.
bool mainTaskIsDone = false;
int counter = 0; // keep trak of which chunk we have added to the list
// This task will be executed on a separate thread. Meanwhile the other thread adds nodes to
// "listToProcess" array this task will add those chunks to the dictionary.
var task1 = Task.Factory.StartNew(() =>
{
semaphore.WaitOne(); // wait until there are at least 1024 nodes to be processed
int indexOnList = 0; // counter to identify the index of chunk[] we are adding to dictionary
while (true)
{
if (indexOnList>=counter) // if equal it might be dangerous!
{ // chunk could be being written to and at the same time being parsed.
if (mainTaskIsDone)// if the main task is done executing stop
break;
task1IsWaiting = true; // otherwise wait until there are more chunks to be processed
semaphore.WaitOne();
}
ProcessChunk(listToProcess[indexOnList]); // add chunk to dictionary
indexOnList++;
}
});
// this block being executed on main thread is responsible for placing the streamreader
// into chunks of char[] so that task1 can start processing those chunks
{
int waitCounter = 1024; // every time task1 is waiting we use this counter to place at least 256 new chunks before continue to parse them
while (true) // more chunks on listToProcess before task1 continues executing
{
char[] buffer = new char[2048]; // buffer where we will place data read from stream
var charsRead = unparsedDebugInfo.Read(buffer, 0, buffer.Length);
if (charsRead < 1){
listToProcess[counter] = pattern;
break;
}
listToProcess[counter] = buffer;
counter++; // add chunk to list to be proceesed by task1.
if (task1IsWaiting)
{ // if task1 is waiting for more nodes process 256
waitCounter = counter + 256; // more nodes then continue execution of task2
task1IsWaiting = false;
}
else if (counter == waitCounter)
semaphore.Release();
}
}
mainTaskIsDone = true; // let other thread know that this task is done
semaphore.Release(); // release all threads that might be waiting on this thread
task1.Wait(); // wait for all nodes to finish processing

Related

C# .net core 3.1 read a file twice CsvHelper

I am trying to read a csv file twice with C# NET CORE 3.1. I have a limit of 100k records, if more than this I should return an error, otherwise, I should process the file. I want to get the count of the records before the actual processing (second read) so that the user gets a quick response in case the records are more than the limit.
The following code doesn't work as we can read the stream only once. Is there any way in which we can read it twice so the given code works fine?
private IList<TObject> TryReadCsv(IFormFile file)
{
int maxRecordsLimit = 100000;
var result = new List<TObject>();
try
{
using StreamReader reader = new StreamReader(file.OpenReadStream(), Encoding.UTF8);
var config = new CsvConfiguration(CultureInfo.InvariantCulture) { Delimiter = ",", Encoding = Encoding.UTF8 };
using var csv = new CsvReader(reader, config);
// First read; light read to get the number of records without any heavy processing
int recordsCount = 0;
while (csv.Read())
{
if (recordsCount > MaxRecordsLimit)
{
// Return the error from here that the number of records are more than the maxRecordsLimit
}
recordsCount++;
}
// Second read for processing; heavy read
while (csv.Read())
{
// Do some heavy processing which takes some more time
}
return result;
}
catch (CsvHelperException ex)
{
// catch exception
}
}

As #jdweng answered I needed to set the reader position to zero as:
reader.BaseStream.Position = 0;

Retrieving entire line from a socket in C#?

I have a simple client-server system sending plain text - though only commands that have been approved. The server is a Python system - and I've confirmed proper connections.
However, the client is C# - in Unity. Searching for examples, I stumbled across this bit of code. It does seem to do what I want, however, only partially:
public String readSocket()
{
if (!socketReady)
return "";
if (theStream.DataAvailable)
return theReader.ReadLine();
return "";
}
The strings I am sending end with \n, but I'm only getting half the message like this:
Message A:
claim_2
Message B:
_20_case
claim_1
I know this probably has to do with how I'm directly reading the line but I cannot find any better examples - strangely enough, everyone seems to point back at this snippet even when multiple people point out the problems.
Can anything be done to fix this bit of code properly?
In case it helps, I'm sending the information (from my Python server) out like this:
action = str(command) + "_" + str(x) + "_" + str(userid) + "_" + str(user)
cfg.GameSendConnection.sendall((action + "\n").encode("utf-8"))

When you do sockets programming, it is important to note that data might not be
available in one piece. In fact, this is exactly what you are seeing. Your
messages are being broken up.
So why does ReadLine not wait until there's a line to read?.
Here's some simple sample code:
var stream = new MemoryStream();
var reader = new StreamReader(stream);
var writer = new StreamWriter(stream) { AutoFlush = true };
writer.Write("foo");
stream.Seek(0, SeekOrigin.Begin);
Console.WriteLine(reader.ReadLine());
Note that there is no newline at the end. Still, the output of this little
snippet is foo.
ReadLine returns the string up to the first line break or until there is no
more data to read. The exception being reading from a stream that has no more
data to read, then it returns null.
When a NetworkStream has its DataAvailable property return true, it has
data. But as mentioned before, there is no guarantee whatsoever about what that
data is. It might be a single byte. Or a part of a message. Or a full message
plus part of the next message. Note that depending on the encoding, it could
even be possible to receive only part of a character. Not all character
encodings have all characters be at most a single byte. This includes UTF-8, which cfg.GameSendConnection.sendall((action + "\n").encode("utf-8")) sends.
How to solve this? Read bytes, not lines. Put them in some buffer. After every
read, check if the buffer contains a newline. If it does, you now have a full
message to handle. Remove the message up to and including the newline from the
buffer and keep appending new data to it until the next newline is received. And
so on.

This is how I process the entire line in my similar application, which is a very simple code, and your code may be different, but you can get the idea.
private string incompleteRecord = "";
public void ReadSocket()
{
if (_networkStream.DataAvailable)
{
var buffer = new byte[8192];
var receivedString = new StringBuilder();
do
{
int numberOfBytesRead = _networkStream.Read(buffer, 0, buffer.Length);
receivedString.AppendFormat("{0}", Encoding.UTF8.GetString(buffer, 0, numberOfBytesRead));
} while (_networkStream.DataAvailable);
var bulkMsg = receivedString.ToString();
// When you receive data from the socket, you can receive any number of messages at a time
// with no guarantee that the last message you receive will be complete.
// You can receive only part of a complete message, with next part coming
// with the next call. So, we need to save any partial messages and add
// them to the beginning of the data next time.
bulkMsg = incompleteRecord + bulkMsg;
// clear incomplete record so it doesn't get processed next time too.
incompleteRecord = "";
// loop though the data breaking it apart into lines by delimiter ("\n")
while (bulkMsg.Length > 0)
{
var newLinePos = bulkMsg.IndexOf("\n");
if (newLinePos > 0)
{
var line = bulkMsg.Substring(0, newLinePos);
// Do whatever you want with your line here ...
// ProcessYourLine(line)
// Move to the next message.
bulkMsg = bulkMsg.Substring(line.Length + 1);
}
else
{
// there are no more newline delimiters
// so we save the rest of the message (if any) for processing with the next batch of received data.
incompleteRecord = bulkMsg;
bulkMsg = "";
}
}
}
}

Memory Issue in string C#

I have little test program
public class Test
{
public string Response { get; set; }
}
My console simply call Test class
class Program
{
static void Main(string[] args)
{
Test t = new Test();
using (StreamReader reader = new StreamReader("C:\\Test.txt"))
{
t.Response = reader.ReadToEnd();
}
t.Response = t.Response.Substring(0, 5);
Console.WriteLine(t.Response);
Console.Read();
}
}
I have appox 60 MB data in my Test.txt file. When the program get executes, it is taking lot of memory because string is immutable. What is the better way handle this kind of scenario using string.
I know that i can use string builder. but i have created this program to replicate a scenario in one of my production application which uses string.
when i tried with GC.Collect(), memory is released immediately. I am not sure whether i can call GC in code.
Please help. Thanks.
UPDATE:
I think i did not explain it clearly. sorry for the confusion.
I am just reading data from file to get huge data as don't want create 60MB of data in code.
My pain point is below line of code where i have huge data in Response field.
t.Response = t.Response.Substring(0, 5);

You could limit your reads to a block of bytes (buffer). Loop through and read the next block into your buffer and write that buffer out. This will prevent a large chunk of data being stored in memory.
using (StreamReader reader = new StreamReader(#"C:\Test.txt", true))
{
char[] buffer = new char[1024];
int idx = 0;
while (reader.ReadBlock(buffer, idx, buffer.Length) > 0)
{
idx += buffer.Length;
Console.Write(buffer);
}
}

Can you read your file line by line? If so, I would recommend calling:
IEnumerable<string> lines = File.ReadLines(path)
When you iterate this collection using
foreach(string line in lines)
{
// do something with line
}
the collection will be iterated using lazy evaluation. That means the entire contents of the file won't need to be kept in memory while you do something with each line.

StreamReader provides just version of Read that you looking for - Read(Char[], Int32, Int32) - which lets you pick out first characters of the stream. Alternatively you can read char-by-char with regular StreamReader.Read till you decided that you have enough.
var textBuffer = new char[5];
reader.ReadToEnd(textBuffer, 0, 5); // TODO: check if it actually read engough
t.Response = new string(textBuffer);
Note that if you know encoding of the stream you may use lower level reading as byte array and use System.Text.Encoding classes to construct strings with encoding yourself instead of relaying on StreamReader.

tcp server failing after first loop

As above really, I'm trying to create a tcp linux server in c to accept data, perform some processing then send it back.
The code I'm trying to use on the client side to send the data and then read it back:
TcpClient tcpclnt = new TcpClient();
tcpclnt.Connect("192.168.0.14", 1235);
NetworkStream stm = tcpclnt.GetStream();
_signal.WaitOne();
Image<Bgr, Byte> frame = null;
while (_queue.TryDequeue(out frame))
{
if (frame != null)
{
resizedBMPFrame = frame.Resize(0.5, Emgu.CV.CvEnum.INTER.CV_INTER_LINEAR).ToBitmap();
using (MemoryStream ms = new MemoryStream())
{
resizedBMPFrame.Save(ms, ImageFormat.Bmp);
byte[] byteFrame = ms.ToArray();
l = byteFrame.Length;
byte[] buf = Encoding.UTF8.GetBytes(l.ToString());
stm.Write(buf, 0, buf.Length);
stm.Write(byteFrame, 0, byteFrame.Length);
}
}
else
{
Reading = false;
}
int i;
Bitmap receivedBMPFrame;
byte[] receivedFramesize = new byte[4];
int j = stm.Read(receivedFramesize, 0, receivedFramesize.Length);
int receivedFramesizeint = BitConverter.ToInt32(receivedFramesize, 0);
byte[] receivedFrame = new byte[receivedFramesizeint];
j = stm.Read(receivedFrame, 0, receivedFrame.Length);
using (MemoryStream ms = new MemoryStream(receivedFrame))
{
receivedBMPFrame = new Bitmap(ms);
if (receivedBMPFrame != null)
{
outputVideoPlayer.Image = receivedBMPFrame;
}
else
{
Reading = false;
}
}
}
}
stm.Close();
tcpclnt.Close();
So the idea is it waits for the display thread to send the current frame it's displaying using a concurrentqueue, it then takes it, and makes it a quarter of the size, converts it to a byte array and then sends its length and then it itself over the tcp socket.
In theory the server gets it, performs some processing then sends it back, so it reads the length of it then the new frame itself.
The server code is below:
while (1)
{
int incomingframesize;
int n;
n = read(conn_desc, framesizebuff, 6);
if ( n > 0)
{
printf("Length of incoming frame is %s\n", framesizebuff);
}
else
{
printf("Failed receiving length\n");
return -1;
}
char framebuff[atoi(framesizebuff)];
n = read(conn_desc, framebuff, atoi(framesizebuff));
if ( n > 0)
{
printf("Received frame\n");
}
else
{
printf("Failed receiving frame\n");
return -1;
}
printf("Ready to write\n");
int k = sizeof(framebuff);
n = write(conn_desc, &k, sizeof(int));
if (n <0)
{
printf("ERROR writing to socket\n");
}
else
{
printf("Return frame size is %d\n", k);
}
n = write(conn_desc, &framebuff, sizeof(char)*k);
if (n <0)
{
printf("ERROR writing to socket\n");
}
frameno++;
printf("Frames sent: %d\n", frameno);
}
So it reads the length, then the actual frame, which seems to work, and at the moment then just sends it straight back without doing any processing. However it only works for one loop seemingly, if I step through the client code line by line the server code runs through once, but on the 2nd read by the client, receiving the frame from the server, the server then runs the two reads of the loop straight away, without waiting for another write. Failing on the 2nd having seemingly read in nothing as it outputs:
Length of incoming frame is
Failed receiving frame
With no number, which to me makes sense as I haven't sent another write with the length of the next frame. I'm just wondering what I'm missing/why it's acting like this? As on the first loop it waits until the write commands from the client. I'm wondering if it means there is left over data in the write stream, so when it goes back to the top it immediately reads it again? Although it then doesn't print any form of number which to me implies there's nothing there...
Any help would be greatly appreciated.
EDIT/UPDATE:
Changed the read/write sections on the server to do a single byte at a time like this:
while (ntotal != incomingframesize)
{
n = read(conn_desc, &framebuff[ntotal], sizeof(char));
ntotal = ntotal + n;
while (i < k)
{
m = write(conn_desc, &framebuff[i], sizeof(char));
Which seems to have solved the problems I was having and now the correct data is being transferred :)

When the client writes the frame size it uses the length of some object, but when the server reads it always tries to read 6 characters. You need to use a fixed length for the frame size!
When reading, you cannot assume that you get as many bytes as you ask for. The return value, if >0, is the number of bytes actually read. If you get less than you asked for, you need to keep reading until you have received the number of bytes you expect.
First read until you've got 6 bytes (frame size).
Next read until you've got the number of bytes indicated by the frame size.
Make sure you use the same number of bytes for the frame size in all places.
Edit:
I also noted a bug in the call to write in the server:
n = write(conn_desc, &framebuff, sizeof(char)*k);
framebuff is a pointer to the data, so you probably mean:
n = write(conn_desc, &framebuff[0], k);

How to read text file from memorystream without missing bytes

I am writing some code to learn new c# async design patterns. So I thought writing a small windows forms program that counts lines and words of text files and display the reading progress.
Avoiding disk swapping, I read files into a MemoryStream and then build a StreamReader to read text by lines and count.
The issue is I can`t update the progressbar right.
I read a file but always there are bytes missing, so the progressbar doesn't fill entirely.
Need a hand or a idea to achieve this. Thanks
public async Task Processfile(string fname)
{
MemoryStream m;
fname.File2MemoryStream(out m); // custom extension which read file into
// MemoryStream
int flen = (int)m.Length; // store File length
string line = string.Empty; // used later to read lines from streamreader
int linelen = 0; // store current line bytes
int readed = 0; // total bytes read
progressBar1.Minimum = 0; // progressbar bound to winforms ui
progressBar1.Maximum = flen;
using (StreamReader sr = new StreamReader(m)) // build streamreader from ms
{
while ( ! sr.EndOfStream ) // tried ( line = await sr.ReadLineAsync() ) != null
{
line = await sr.ReadLineAsync();
await Task.Run(() =>
{
linelen = Encoding.UTF8.GetBytes(line).Length; // get & update
readed += linelen; // bytes read
// custom function
Report(new Tuple<int, int>(flen, readed)); // implements Iprogress
// to feed progress bar
});
}
}
m.Close(); // releases MemoryStream
m = null;
}

The total length being assigned to flen includes the carriage returns of each line. The ReadLineAsync() function returns a string that does not include the carriage return. My guess is that the amount of missing bytes in your progress bar is directly proportional to the amount of carriage returns in the file being read.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parse StreamReader using regex efficiently - c#

Related

C# .net core 3.1 read a file twice CsvHelper

Retrieving entire line from a socket in C#?

Memory Issue in string C#

tcp server failing after first loop

How to read text file from memorystream without missing bytes

Categories

Resources