Skip First Row (CSV Header Row) of HttpResponseMessage Content.ReadAsStream - c#

Below is a simplified example of a larger piece of code. Basically I'm calling one or more API endpoints and downloading a CSV file that gets written to an Azure Blob Container. If there's multiple files, the blob is appended for every new csv file loaded.
The issue is when I append the target blob I ended up with a multiple header rows scattered throughout the file depending on how may CSVs I consumed. All the CSVs have the same header row and I know the first row will always have a line feed. Is there a way to read the stream, skip the content until after the first line feed and then copy the stream to the blob?
It seemed simple in my head, but I'm having trouble finding my way there code-wise. I don't want to wait for the whole file to download and then in-memory delete the header row since some of these files can be several gigabytes.
I am using .net core v6 if that helps
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
sourceStream.CopyTo(blobStream);
}
}

.CopyTo copies from the current position in the stream. So all you need to do is throw away all the characters until you have thrown away the first CR or Line Feed.
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
if (i != 0)
{
char c;
do { c = (char)sourceStream.ReadByte(); } while (c != '\n');
}
sourceStream.CopyTo(blobStream);
}
}
If all the files always have the same size header row, you can come up with a constant for its length. That way you could just skip the stream to the exact correct location like this:
using Stream blobStream = await blockBlobClient.OpenWriteAsync(true);
{
for (int i = 0; i < 3; i++)
{
using HttpResponseMessage response = await client.GetAsync(downloadUrls[i], HttpCompletionOption.ResponseHeadersRead);
Stream sourceStream = response.Content.ReadAsStream();
if (i != 0)
sourceStream.Seek(HeaderSizeInBytes, SeekOrigin.Begin);
sourceStream.CopyTo(blobStream);
}
}
This will be slightly quicker but does have the downside that the files can't change format easily in the future.
P.S. You probably want to Dispose sourceStream. Either directly or by wrapping its creation in a using statement.

If we can assume that stream contains UTF 8 encoded text then you can do the following:
Create a streamReader against sourceStream
var reader = new StreamReader(sourceStream);
Read the first line (assumed the lines ends with \n)
var header = reader.ReadLine();
Convert the first line + a \n to byte array
var headerInBytes = Encoding.UTF8.GetBytes(header + Environment.NewLine);
Set the position after the first line
sourceStream.Position = headerInBytes.Length;
Copy the source stream from the desired position
sourceStream.CopyTo(blobStream);
This proposed solution is just an example, depending on the actual stream content you might need to further adjust it and make it more robust.

Related

Stream.CopyToAsync is empty after first iteration

Background: I need to relay the content of the request to multiple other servers (via client.SendAsync(request)).
Problem: After first request the content stream is empty
[HttpPost]
public async Task<IActionResult> PostAsync() {
for (var n = 0; n <= 1; n++) {
using (var stream = new MemoryStream()) {
await Request.Body.CopyToAsync(stream);
// why is stream.length == 0 in the second iteration?
}
}
return StatusCode((int)HttpStatusCode.OK);
}
Streams have a pointer indicating at which position the stream is; after copying it, the pointer is at the end. You need to rewind a stream by setting its position to 0.
This is however only supported in streams that support seeking. You can read the request stream only once. This is because it's read "from the wire", and therefore doesn't support seeking.
When you want to copy the request stream to multiple output streams, you have two options:
Forward while you read
Read once into memory, then forward at will
The first option means all forwards happen at the same speed; the entire transfer goes as slow as the input, or as slow as the slowest reader. You read a chunk from the caller, and forward that chunk to all forward addresses.
For the second approach, you'll want to evaluate whether you can hold the entire request body plus the body for each forward address in memory. If that's not expected to be a problem and properly configured with sensible limits, then simply copy the request stream to a single MemoryStream and copy and rewind that one after every call:
using (var bodyStream = new MemoryStream())
{
await Request.Body.CopyToAsync(bodyStream);
for (...)
{
using (var stream = new MemoryStream())
{
await bodyStream.CopyToAsync(stream);
// Rewind for next copy
bodyStream.Position = 0;
}
}
}
I found out that the CopyToAsync function sets the origin stream position to the last read position. The next time I use CopyToAsync the stream starts reading from the last read position and does not find more content. However I could not use Request.Body.Position = 0 since it is not supported. I ended up copying the stream once more and reset the position after each copy.
If someone knows a cleaner solution you are welcome to point it out.
using (var contentStream = new MemoryStream()) {
await Request.Body.CopyToAsync(contentStream);
for (var n = 0; n <= 1; n++) {
using (var stream = new MemoryStream()) {
contentStream.Position = 0;
await contentStream.CopyToAsync(stream);
// works
}
}
}

C# - Downloading from Google Drive in byte chunks

I'm currently developing for an environment that has poor network connectivity. My application helps to automatically download required Google Drive files for users. It works reasonably well for small files (ranging from 40KB to 2MB), but fails far too often for larger files (9MB). I know these file sizes might seem small, but in terms of my client's network environment, Google Drive API constantly fails with the 9MB file.
I've concluded that I need to download files in smaller byte chunks, but I don't see how I can do that with Google Drive API. I've read this over and over again, and I've tried the following code:
// with the Drive File ID, and the appropriate export MIME type, I create the export request
var request = DriveService.Files.Export(fileId, exportMimeType);
// take the message so I can modify it by hand
var message = request.CreateRequest();
var client = request.Service.HttpClient;
// I change the Range headers of both the client, and message
client.DefaultRequestHeaders.Range =
message.Headers.Range =
new System.Net.Http.Headers.RangeHeaderValue(100, 200);
var response = await request.Service.HttpClient.SendAsync(message);
// if status code = 200, copy to local file
if (response.IsSuccessStatusCode)
{
using (var fileStream = new FileStream(downloadFileName, FileMode.CreateNew, FileAccess.ReadWrite))
{
await response.Content.CopyToAsync(fileStream);
}
}
The resultant local file (from fileStream) however, is still full-length (i.e. 40KB file for the 40KB Drive file, and a 500 Internal Server Error for the 9MB file). On a sidenote, I've also experimented with ExportRequest.MediaDownloader.ChunkSize, but from what I observe it only changes the frequency at which the ExportRequest.MediaDownloader.ProgressChanged callback is called (i.e. callback will trigger every 256KB if ChunkSize is set to 256 * 1024).
How can I proceed?
You seemed to be heading in the right direction. From your last comment, the request will update progress based on the chunk size, so your observation was accurate.
Looking into the source code for MediaDownloader in the SDK the following was found (emphasis mine)
The core download logic. We download the media and write it to an
output stream ChunkSize bytes at a time, raising the ProgressChanged
event after each chunk. The chunking behavior is largely a historical
artifact: a previous implementation issued multiple web requests, each
for ChunkSize bytes. Now we do everything in one request, but the API
and client-visible behavior are retained for compatibility.
Your example code will only download one chunk from 100 to 200. Using that approach you would have to keep track of an index and download each chunk manually, copying them to the file stream for each partial download
const int KB = 0x400;
int ChunkSize = 256 * KB; // 256KB;
public async Task ExportFileAsync(string downloadFileName, string fileId, string exportMimeType) {
var exportRequest = driveService.Files.Export(fileId, exportMimeType);
var client = exportRequest.Service.HttpClient;
//you would need to know the file size
var size = await GetFileSize(fileId);
using (var file = new FileStream(downloadFileName, FileMode.CreateNew, FileAccess.ReadWrite)) {
file.SetLength(size);
var chunks = (size / ChunkSize) + 1;
for (long index = 0; index < chunks; index++) {
var request = exportRequest.CreateRequest();
var from = index * ChunkSize;
var to = from + ChunkSize - 1;
request.Headers.Range = new RangeHeaderValue(from, to);
var response = await client.SendAsync(request);
if (response.StatusCode == HttpStatusCode.PartialContent || response.IsSuccessStatusCode) {
using (var stream = await response.Content.ReadAsStreamAsync()) {
file.Seek(from, SeekOrigin.Begin);
await stream.CopyToAsync(file);
}
}
}
}
}
private async Task<long> GetFileSize(string fileId) {
var file = await driveService.Files.Get(fileId).ExecuteAsync();
var size = file.size;
return size;
}
This code makes some assumptions about the drive api/server.
That the server will allow the multiple requests needed to download the file in chunks. Don't know if requests are throttled.
That the server still accepts the Range header like stated in the developer documenation

How to properly open and read from a StorageFile multiple times?

In Windows Phone 8.1 (WinRT) I'm grabbing a file from the user's document folder and trying to read through it twice. Once to read each line and get a count of total line for progress tracking purposes. And the second time to actually parse the data. However, on the second pass I get a "File is not readable" type error. So I have a small understanding of what's going on but not entirely. Am I getting this error because the stream of the file is already at the end of the file? Can't I just open a new stream from the same file, or do I have to close the first stream?
Here's my code:
public async Task UploadBerData(StorageFile file)
{
_csvParser = new CsvParser();
var stream = await file.OpenAsync(FileAccessMode.Read);
using (var readStream = stream.AsStreamForRead())
{
dataCount = _csvParser.GetDataCount(stream.AsStreamForRead());
// Set the progressBar total to 2x dataCount.
// Once for reading, twice for uploading data
TotalProgress = dataCount * 2;
CurrentProgress = 0;
}
var csvData = _csvParser.GetFileData(stream.AsStreamForRead());
...
}
After using the Stream, the position is the end of stream length.
You can set it to beginning to read stream again.
Add following line before your parse data function.
stream.Position = 0;

How to read text file from memorystream without missing bytes

I am writing some code to learn new c# async design patterns. So I thought writing a small windows forms program that counts lines and words of text files and display the reading progress.
Avoiding disk swapping, I read files into a MemoryStream and then build a StreamReader to read text by lines and count.
The issue is I can`t update the progressbar right.
I read a file but always there are bytes missing, so the progressbar doesn't fill entirely.
Need a hand or a idea to achieve this. Thanks
public async Task Processfile(string fname)
{
MemoryStream m;
fname.File2MemoryStream(out m); // custom extension which read file into
// MemoryStream
int flen = (int)m.Length; // store File length
string line = string.Empty; // used later to read lines from streamreader
int linelen = 0; // store current line bytes
int readed = 0; // total bytes read
progressBar1.Minimum = 0; // progressbar bound to winforms ui
progressBar1.Maximum = flen;
using (StreamReader sr = new StreamReader(m)) // build streamreader from ms
{
while ( ! sr.EndOfStream ) // tried ( line = await sr.ReadLineAsync() ) != null
{
line = await sr.ReadLineAsync();
await Task.Run(() =>
{
linelen = Encoding.UTF8.GetBytes(line).Length; // get & update
readed += linelen; // bytes read
// custom function
Report(new Tuple<int, int>(flen, readed)); // implements Iprogress
// to feed progress bar
});
}
}
m.Close(); // releases MemoryStream
m = null;
}
The total length being assigned to flen includes the carriage returns of each line. The ReadLineAsync() function returns a string that does not include the carriage return. My guess is that the amount of missing bytes in your progress bar is directly proportional to the amount of carriage returns in the file being read.

How to tell if a file is text-readable in C#

Part of a list of projects I'm doing is a little text-editor.
At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.
What I want the functionality to be is to only add the files that are readable by a normal text reader.
This code currently adds it to the tree:
TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;
directoryNode.Nodes.Add(navNode);
I know I could easily create an if statement with something like:
if(file.extension.equals(".txt"))
but I would have to expand that statement to contain every single extension that it could possibly be.
Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.
There is no general way of figuring type of information stored in the file.
Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.
Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.
There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.
Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.
What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.
public static async Task<bool> IsValidTextFileAsync(string path,
int scanLength = 4096)
{
using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var bufferLength = (int)Math.Min(scanLength, stream.Length);
var buffer = new char[bufferLength];
var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
reader.Close();
if(bytesRead != bufferLength)
throw new IOException("There was an error reading from the file.");
for(int i = 0; i < bytesRead; i++)
{
var c = buffer[i];
if(char.IsControl(c))
return false;
}
return true;
}
}
My approach based on #Rubenisme's comment and #Erik's answer.
public static bool IsValidTextFile(string path)
{
using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8))
{
var bytesRead = reader.ReadToEnd();
reader.Close();
return bytesRead.All(c => // Are all the characters either a:
c == (char)10 // New line
|| c == (char)13 // Carriage Return
|| c == (char)11 // Tab
|| !char.IsControl(c) // Non-control (regular) character
);
}
}
A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)

Categories

Resources