How to read text file from memorystream without missing bytes

How to read text file from memorystream without missing bytes - c#

I am writing some code to learn new c# async design patterns. So I thought writing a small windows forms program that counts lines and words of text files and display the reading progress.
Avoiding disk swapping, I read files into a MemoryStream and then build a StreamReader to read text by lines and count.
The issue is I can`t update the progressbar right.
I read a file but always there are bytes missing, so the progressbar doesn't fill entirely.
Need a hand or a idea to achieve this. Thanks
public async Task Processfile(string fname)
{
MemoryStream m;
fname.File2MemoryStream(out m); // custom extension which read file into
// MemoryStream
int flen = (int)m.Length; // store File length
string line = string.Empty; // used later to read lines from streamreader
int linelen = 0; // store current line bytes
int readed = 0; // total bytes read
progressBar1.Minimum = 0; // progressbar bound to winforms ui
progressBar1.Maximum = flen;
using (StreamReader sr = new StreamReader(m)) // build streamreader from ms
{
while ( ! sr.EndOfStream ) // tried ( line = await sr.ReadLineAsync() ) != null
{
line = await sr.ReadLineAsync();
await Task.Run(() =>
{
linelen = Encoding.UTF8.GetBytes(line).Length; // get & update
readed += linelen; // bytes read
// custom function
Report(new Tuple<int, int>(flen, readed)); // implements Iprogress
// to feed progress bar
});
}
}
m.Close(); // releases MemoryStream
m = null;
}

The total length being assigned to flen includes the carriage returns of each line. The ReadLineAsync() function returns a string that does not include the carriage return. My guess is that the amount of missing bytes in your progress bar is directly proportional to the amount of carriage returns in the file being read.

Related

How can I save a stream of Json data to a text file in C# Windows Forms app?

I've got a stream of data incoming as a Json file and I'm trying to save it to a text file, I've got it working here below however, when i check the file, it only has the last Json message received saved, I am trying to get it so that once it saves a line it goes onto a new line and prints the latest Json message below. at the moment it will print let's say 1000 lines but they are all the same and they match the latest Json received.
Any help would be much appreciated.
void ReceiveData() //This function is used to listen for messages from the flight simulator
{
while (true)
{
NetworkStream stream = client.GetStream(); //sets the network stream to the client's stream
byte[] buffer = new byte[256]; //Defines the max amount of bytes that can be sent
int bytesRead = stream.Read(buffer, 0, buffer.Length);
if (bytesRead > 0)
{
string jsonreceived = Encoding.ASCII.GetString(buffer, 0, bytesRead); //Converts the received data into ASCII for the json variable
JavaScriptSerializer serializer = new JavaScriptSerializer();
TelemetryUpdate telemetry = serializer.Deserialize<TelemetryUpdate>(jsonreceived);
this.Invoke(new Action(() => { TelemetryReceivedLabel.Text = jsonreceived;
})) ;
Updatelabels(telemetry); //runs the update labels function with the telemetry data as an argument
File.Delete(#"c:\temp\BLACKBOX.txt"); // this deletes the original file
string path = #"c:\temp\BLACKBOX.txt"; //this stores the path of the file in a string
using (StreamWriter sw = File.CreateText(path)) // Create a file to write to.
{
for (int i = 0; i<10000; i++)
{
sw.Write(jsonreceived.ToString()); //writes the json data to the file
}
}
}
}
}

As per the .NET documentation for File.CreateText:
Creates or opens a file for writing UTF-8 encoded text. If the file already exists, its contents are overwritten.
So, every time you call File.CreateText you're creating a new StreamWriter that's going to overwrite the contents of your file. Try using File.AppendText instead to pick up where you left off.

Skip First Row (CSV Header Row) of HttpResponseMessage Content.ReadAsStream

Below is a simplified example of a larger piece of code. Basically I'm calling one or more API endpoints and downloading a CSV file that gets written to an Azure Blob Container. If there's multiple files, the blob is appended for every new csv file loaded.
The issue is when I append the target blob I ended up with a multiple header rows scattered throughout the file depending on how may CSVs I consumed. All the CSVs have the same header row and I know the first row will always have a line feed. Is there a way to read the stream, skip the content until after the first line feed and then copy the stream to the blob?
It seemed simple in my head, but I'm having trouble finding my way there code-wise. I don't want to wait for the whole file to download and then in-memory delete the header row since some of these files can be several gigabytes.
I am using .net core v6 if that helps
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
sourceStream.CopyTo(blobStream);
}
}

.CopyTo copies from the current position in the stream. So all you need to do is throw away all the characters until you have thrown away the first CR or Line Feed.
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
if (i != 0)
{
char c;
do { c = (char)sourceStream.ReadByte(); } while (c != '\n');
}
sourceStream.CopyTo(blobStream);
}
}
If all the files always have the same size header row, you can come up with a constant for its length. That way you could just skip the stream to the exact correct location like this:
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
if (i != 0)
sourceStream.Seek(HeaderSizeInBytes, SeekOrigin.Begin);
sourceStream.CopyTo(blobStream);
}
}
This will be slightly quicker but does have the downside that the files can't change format easily in the future.
P.S. You probably want to Dispose sourceStream. Either directly or by wrapping its creation in a using statement.

If we can assume that stream contains UTF 8 encoded text then you can do the following:
Create a streamReader against sourceStream
var reader = new StreamReader(sourceStream);
Read the first line (assumed the lines ends with \n)
var header = reader.ReadLine();
Convert the first line + a \n to byte array
var headerInBytes = Encoding.UTF8.GetBytes(header + Environment.NewLine);
Set the position after the first line
sourceStream.Position = headerInBytes.Length;
Copy the source stream from the desired position
sourceStream.CopyTo(blobStream);
This proposed solution is just an example, depending on the actual stream content you might need to further adjust it and make it more robust.

ASP.NET MVC 5 best approach for outputting and downloading file progress

I must to create a function which allow an user click a button in browser and request to server to get information from database. The information will be output to a .csv file, after that client able to download it.
I intend to use one of two method of System.Web.Mvc.Controller
protected internal virtual FileStreamResult File(Stream fileStream, string contentType, string fileDownloadName)
or
protected internal virtual FileContentResult File(byte[] fileContents, string contentType, string fileDownloadName);
But the output file size can be up to 4 gigabytes, so I worry the buffering size of a file in Stream or byte[] will cause server memory leaks.
I split my progress into 2 steps
Step 1: output csv file with memory leaks consideration
public CsvOutputHelper(List<string> csvHeader)
{
this._csvHeader = csvHeader;
}
public void OutputFile(List<List<string>> data, string filePath)
{
// Create the new file and write the header text
File.WriteAllText(filePath, this._csvHeader.ConvertToCsvRecord());
// 'StringBuilder' for output file
var sb = new StringBuilder();
sb.AppendLine();
// Line counter
var lineCounterValue = 1;
for (var i = 0; i < data.Count; i++)
{
// Create line content of csv file and append it to buffer string
sb.AppendLine(data[i].ConvertToCsvRecord());
// Increase value of line counter
lineCounterValue++;
// If buffer string is reach to 100 lines or the loop go to the end of data list,
// output text to file and reset value of buffer string and value of line counter
if (lineCounterValue == MaxCountOfOutputLine || i == data.Count - 1)
{
// Output buffer string
File.AppendAllText(filePath, sb.ToString());
sb = new StringBuilder(); // Re-create a new instance of 'StringBuilder'
lineCounterValue = 1; // Reset line counter value to 1
}
}
}
Step 2: return output filePath in server (relative path) to browser and browser request to path to download file.
Is my solution is a good way to implement in that case? Is using Stream or bytes will to be caused server memory leaks or not? Please explain for me.

How to properly open and read from a StorageFile multiple times?

In Windows Phone 8.1 (WinRT) I'm grabbing a file from the user's document folder and trying to read through it twice. Once to read each line and get a count of total line for progress tracking purposes. And the second time to actually parse the data. However, on the second pass I get a "File is not readable" type error. So I have a small understanding of what's going on but not entirely. Am I getting this error because the stream of the file is already at the end of the file? Can't I just open a new stream from the same file, or do I have to close the first stream?
Here's my code:
public async Task UploadBerData(StorageFile file)
{
_csvParser = new CsvParser();
var stream = await file.OpenAsync(FileAccessMode.Read);
using (var readStream = stream.AsStreamForRead())
{
dataCount = _csvParser.GetDataCount(stream.AsStreamForRead());
// Set the progressBar total to 2x dataCount.
// Once for reading, twice for uploading data
TotalProgress = dataCount * 2;
CurrentProgress = 0;
}
var csvData = _csvParser.GetFileData(stream.AsStreamForRead());
...
}

After using the Stream, the position is the end of stream length.
You can set it to beginning to read stream again.
Add following line before your parse data function.
stream.Position = 0;

Parse StreamReader using regex efficiently

I have the variable
StreamReader DebugInfo = GetDebugInfo();
var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!! because there are a lot of students
text equals:
<student>
<firstName>Antonio</firstName>
<lastName>Namnum</lastName>
</student>
<student>
<firstName>Alicia</firstName>
<lastName>Garcia</lastName>
</student>
<student>
<firstName>Christina</firstName>
<lastName>SomeLattName</lastName>
</student>
... etc
.... many more students
what am I doing now is:
StreamReader DebugInfo = GetDebugInfo();
var text = DebugInfo.ReadToEnd(); // takes 10 seconds!!!
var mtch = Regex.Match(text , #"(?s)<student>.+?</student>");
// keep parsing the file while there are more students
while (mtch.Success)
{
AddStudent(mtch.Value); // parse text node into object and add it to corresponding node
mtch = mtch.NextMatch();
}
the whole process takes about 25 seconds. to convert the streamReader to text (var text = DebugInfo.ReadToEnd();) that takes 10 seconds. the other part takes about 15 seconds. I was hoping I could do the two parts at the same time...
EDIT
I will like to have something like:
const int bufferSize = 1024;
var sb = new StringBuilder();
Task.Factory.StartNew(() =>
{
Char[] buffer = new Char[bufferSize];
int count = bufferSize;
using (StreamReader sr = GetUnparsedDebugInfo())
{
while (count > 0)
{
count = sr.Read(buffer, 0, bufferSize);
sb.Append(buffer, 0, count);
}
}
var m = sb.ToString();
});
Thread.Sleep(100);
// meanwhile string is being build start adding items
var mtch = Regex.Match(sb.ToString(), #"(?s)<student>.+?</student>");
// keep parsing the file while there are more nodes
while (mtch.Success)
{
AddStudent(mtch.Value);
mtch = mtch.NextMatch();
}
Edit 2
Summary
I forgot to mention sorry the text is very similar to xml but it is not. That's why I have to use regular expressions... In short I think I could save time because what am I doing is converting the stream to a string then parsing the string. why not just parse the stream with a regex. Or if that is not possible why not get a chunk of the stream and parse that chunk in a separate thread.

UPDATED:
This basic code reads a (roughly) 20 megabyte file in .75 seconds. My machine should roughly process 53.33 megabytes in that 2 seconds that you reference. Further, 20,000,000 / 2,048 = 9765.625. .75 / 9765.625 = .0000768. That means that you are roughly reading 2048 characters every 768 hundred-thousandths of a second. You need to understand the cost of context switching in relation to the timing of your iterations to determine whether the added complexity of multi-threading is appropriate. At 7.68X10^5 seconds, I see your reader thread sitting idle most of the time. It doesn't make sense to me. Just use a single loop with a single thread.
char[] buffer = new char[2048];
StreamReader sr = new StreamReader(#"C:\20meg.bin");
while(sr.Read(buffer, 0, 2048) != 0)
{
; // do nothing
}
For large operations like this, you want to use a forward-only, non-cached reader. It looks like your data is XML, so an XmlTextReader is perfect for this. Here is some sample code. Hope this helps.
string firstName;
string lastName;
using (XmlTextReader reader = GetDebugInfo())
{
while (reader.Read())
{
if (reader.IsStartElement() && reader.Name == "student")
{
reader.ReadToDescendant("firstName");
reader.Read();
firstName = reader.Value;
reader.ReadToFollowing("lastName");
reader.Read();
lastName = reader.Value;
AddStudent(firstName, lastName);
}
}
}
I used the following XML:
<students>
<student>
<firstName>Antonio</firstName>
<lastName>Namnum</lastName>
</student>
<student>
<firstName>Alicia</firstName>
<lastName>Garcia</lastName>
</student>
<student>
<firstName>Christina</firstName>
<lastName>SomeLattName</lastName>
</student>
</students>
You may need to tweak. This should run much, much faster.

You can read line-by-line, but if reading of the data takes 15 seconds there is not much you can do to speed things up.
Before making any significant changes try to simply read all lines of the file and do no processing. If that still takes longer that your goal - adjust goals/change file format. Otherwise see how much gains you can expect from optimizing parsing - RegEx are quite fast for non-complicated regular expressions.

RegEx isn't the fastest way to parse a string. You need a tailored parser similar to XmlReader (to match your data structure). It will allow you to read the file partially and parse it much faster than RegEx does.
Since you have a limited set of tags and nesting FSM approach (http://en.wikipedia.org/wiki/Finite-state_machine) will work for you.

Here is what turn out to be the fastest (maybe I mist more things to try)
Created an array of arrays char[][] listToProcess = new char[200000][]; where I will place chunks of the stream. On a separate task I started to process each chunk. The code looks like:
StreamReader sr = GetUnparsedDebugInfo(); // get streamReader
var task1 = Task.Factory.StartNew(() =>
{
Thread.Sleep(500); // wait a little so there are items on list (listToProcess) to work with
StartProcesingList();
});
int counter = 0;
while (true)
{
char[] buffer = new char[2048]; // crate a new buffer each time we will add it to the list to process
var charsRead = sr.Read(buffer, 0, buffer.Length);
if (charsRead < 1) // if we reach the end then stop
{
break;
}
listToProcess[counter] = buffer;
counter++;
}
task1.Wait();
and the method StartProcesingList() basically starts going through the list until it reaches a null object.
void StartProcesingList()
{
int indexOnList = 0;
while (true)
{
if (listToProcess[indexOnList] == null)
{
Thread.Sleep(100); // wait a little in case other thread is adding more items to the list
if (listToProcess[indexOnList] == null)
break;
}
// add chunk to dictionary if you recall listToProcess[indexOnList] is a
// char array so it basically converts that to a string and splits it where appropiate
// there is more logic as in the case where the last chunk will have to be
// together with the first chunk of the next item on the list
ProcessChunk(listToProcess[indexOnList]);
indexOnList++;
}
}

#kakridge was right. I could be dealing with a race condition where one task is writing listToProces[30] for example and another thread could be parsing listToProces[30]. To fix that problem and also to remove the Thread.Sleep methods that are ineficient I ended up using semaphores. Here is my new code:
StreamReader unparsedDebugInfo = GetUnparsedDebugInfo(); // get streamReader
listToProcess = new char[200000][];
lastPart = null;
matchLength = 0;
// Used to signal events between thread that is reading text
// from readelf.exe and the thread that is parsing chunks
Semaphore semaphore = new Semaphore(0, 1);
// If task1 run out of chunks to process it will be waiting for semaphore to post a message
bool task1IsWaiting = false;
// Used to note that there are no more chunks to add to listToProcess.
bool mainTaskIsDone = false;
int counter = 0; // keep trak of which chunk we have added to the list
// This task will be executed on a separate thread. Meanwhile the other thread adds nodes to
// "listToProcess" array this task will add those chunks to the dictionary.
var task1 = Task.Factory.StartNew(() =>
{
semaphore.WaitOne(); // wait until there are at least 1024 nodes to be processed
int indexOnList = 0; // counter to identify the index of chunk[] we are adding to dictionary
while (true)
{
if (indexOnList>=counter) // if equal it might be dangerous!
{ // chunk could be being written to and at the same time being parsed.
if (mainTaskIsDone)// if the main task is done executing stop
break;
task1IsWaiting = true; // otherwise wait until there are more chunks to be processed
semaphore.WaitOne();
}
ProcessChunk(listToProcess[indexOnList]); // add chunk to dictionary
indexOnList++;
}
});
// this block being executed on main thread is responsible for placing the streamreader
// into chunks of char[] so that task1 can start processing those chunks
{
int waitCounter = 1024; // every time task1 is waiting we use this counter to place at least 256 new chunks before continue to parse them
while (true) // more chunks on listToProcess before task1 continues executing
{
char[] buffer = new char[2048]; // buffer where we will place data read from stream
var charsRead = unparsedDebugInfo.Read(buffer, 0, buffer.Length);
if (charsRead < 1){
listToProcess[counter] = pattern;
break;
}
listToProcess[counter] = buffer;
counter++; // add chunk to list to be proceesed by task1.
if (task1IsWaiting)
{ // if task1 is waiting for more nodes process 256
waitCounter = counter + 256; // more nodes then continue execution of task2
task1IsWaiting = false;
}
else if (counter == waitCounter)
semaphore.Release();
}
}
mainTaskIsDone = true; // let other thread know that this task is done
semaphore.Release(); // release all threads that might be waiting on this thread
task1.Wait(); // wait for all nodes to finish processing

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read text file from memorystream without missing bytes - c#

Related

How can I save a stream of Json data to a text file in C# Windows Forms app?

Skip First Row (CSV Header Row) of HttpResponseMessage Content.ReadAsStream

ASP.NET MVC 5 best approach for outputting and downloading file progress

How to properly open and read from a StorageFile multiple times?

Parse StreamReader using regex efficiently

Categories

Resources