I have created a text file using TextWriter C#, on final creation the text file often has various rows of whitespace at the end of the file. The whitespace is not included in any of the string objects that make up the file and I don’t know what is causing it. The larger the file the more whitespace there is.
I've tried various tests to see if the whitespace occurs based upon the content on the string, but this is not the case. i.e. I have identified the number of rows where the whitespace starts and changed the string for something completely different but the whitespace still occurs.
//To start:
MemoryStream memoryStream = new MemoryStream();
TextWriter tw = new StreamWriter(memoryStream);
//Loop through records & create a concatenated string object
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
//Add the line to the text file
tw.WriteLine(strUTL1);
//Once all rows are added I complete the file
tw.Flush();
tw.Close();
//Then return the file
return File(memoryStream.GetBuffer(), "text/plain", txtFileName);
I don't want to manipulate the file after completion (e.g. replace blank spaces), as this could lead to other problems. The file will be exchanged with a third party and needs to be formatted exactly.
Thank you for your assistance.
As the doc for MemoryStream.GetBuffer explains:
Note that the buffer contains allocated bytes which might be unused. For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray method; however, ToArray creates a copy of the data in memory.
Use .ToArray() (which will allocate a new array of the right size), or you can use the buffer returned from .GetBuffer() but you'll need to check the .Length to see how many valid bytes are in it.
GetBuffer() returns all the memory that was allocated, which is almost always more bytes than what you actually wrote into it.
Might I suggest using Encoding.UTF8.GetBytes(...) instead:
string strUTL1 = string.Format("{0}{1}{2}{3}{4}{5}{6}{7}", strUTL1_1, strUTL1_2, strUTL1_3, strUTL1_4, strUTL1_5, strUTL1_6, strUTL1_7, strUTL1_8);
var bytes = Encoding.UTF8.GetBytes(strUTL1);
return File(bytes, "text/plain", txtFileName);
Use ToArray() instead of GetBuffer(), since the buffer is larger than needed.
That's often the case. Classes or functions that work with buffers usually reserve a certain size of memory to hold the data. The function will then return a value, how many bytes have been written to the buffer. You shall then only use the first n bytes of the buffer.
Citation of MSDN:
For example, if the string "test" is written into the MemoryStream object, the length of the buffer returned from GetBuffer() is 256, not 4, with 252 bytes unused. To obtain only the data in the buffer, use the ToArray() method; however, ToArray() creates a copy of the data in memory.
Related
My code has to consume data from a NetworkStream, and the data read from the stream will contain three parts: metadata, a well-known delimiter, and data.
I'm trying to determine the most efficient way of reading from the NetworkStream, up to the end of the delimiter. The metadata portion is generally measured in hundreds of bytes (but could be as small as 32 bytes), the delimiter is a specific 2-byte sequence, and the data could range from zero bytes to several gigabytes in size (the metadata provides information on the data length). I should only read up to the delimiter, because the rest of the stream (containing payload data) needs to be used elsewhere, and NetworkStream doesn't support seek and the data may be so large that I can't dump it all into a MemoryStream.
I've been using the following, and it works, but it seems there could be a more efficient way of reading up to the delimiter. Since the minimum metadata size is 32 bytes, I start with a 34-byte buffer (32 bytes of metadata + 2 bytes delimiter), read from the stream, and check for the delimiter. If the delimiter is found (smallest possible metadata), the code then breaks and the balance of the stream contains the data. If the delimiter is not found, the code then loops reading a single byte at a time, checking the last two bytes of the StringBuilder used to hold what has been read from the stream, until the delimiter is found at the end.
(code reduced for brevity, removed checking of negative cases, etc)
string delim = "__";
StringBuilder sb = new StringBuilder();
byte[] buffer = new byte[1];
byte[] initialBuffer = new byte[34];
int bytesRead = stream.Read(initialBuffer, 0, 34); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(initialBuffer);
while (true)
{
string delimCheck = sb.ToString((sb.Length - 2), 2);
if (delimCheck.Equals(delim)) break;
else
{
buffer = new byte[1];
bytesRead = stream.Read(buffer, 0, 1); // yes I check bytesRead in the actual code
sb.Append(Encoding.UTF8.GetString(buffer));
}
}
The code works, but it seems really inefficient and slow to read one byte at a time to reach the end of the delimiter. Is anything readily apparent that might better optimize this code?
Thanks!
Do you see those Read(array, offset, count) return values you are putting into a variable bytesRead and then happily ignoring?
Those (along with setting the socket in non-blocking mode) are the solution to your problem. Then you can access "everything received so far" without getting stuck waiting for enough extra data to arrive to fill your array.
Even in blocking mode, ignoring that return value is a bug, because when the socket is gracefully shut down, you will get a partial read where bytesRead < bytesRequested
Regarding your concerns about how to save the extra data for later, Microsoft provided a class for that. See System.IO.BufferedStream and the example:
The following code examples show how to use the BufferedStream class over the NetworkStream class to increase the performance of certain I/O operations. Start the server on a remote computer before starting the client. Specify the remote computer name as a command-line argument when starting the client. Vary the dataArraySize and streamBufferSize constants to view their effect on performance.
Source: https://learn.microsoft.com/en-us/dotnet/api/system.io.bufferedstream
Not shown in the example is that you still need to put the socket into non-blocking mode to avoid having the BufferedStream block until an entire buffer chunk is received. The Socket class provides the Blocking property to make that easy.
https://learn.microsoft.com/en-us/dotnet/api/system.net.sockets.socket.blocking
I have a file that I'm opening into a stream and passing to another method. However, I'd like to replace a string in the file before passing the stream to the other method. So:
string path = "C:/...";
Stream s = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
//need to replace all occurrences of "John" in the file to "Jack" here.
CallMethod(s);
The original file should not be modified, only the stream. What would be the easiest way to do this?
Thanks...
It's a lot easier if you just read in the file as lines, and then deal with those, instead of forcing yourself to stick with a Stream, simply because stream deals with both text and binary files, and needs to be able to read in one character at a time (which makes such replacement very hard). If you read in a whole line at a time (so long as you don't have multi-line replacement) it's quite easy.
var lines = File.ReadLines(path)
.Select(line => line.Replace("John", "Jack"));
Note that ReadLines still does stream the data, and Select doesn't need to materialize the whole thing, so you're still not reading the whole file into memory at one time when doing this.
If you don't actually need to stream the data you can easily just load it all as one big string, do the replace, and then create a stream based on that one string:
string data = File.ReadAllText(path)
.Replace("John", "Jack");
byte[] bytes = Encoding.ASCII.GetBytes(data);
Stream s = new MemoryStream(bytes);
This question probably has many good answers. I'll try one I've used and has always worked for me and my peers.
I suggest you create a separate stream, say a MemoryStream. Read from your filestream and write into the memory one. You can then extract strings from either and replace stuff, and you would pass the memory stream ahead. That makes it double sure that you are not messing up with the original stream, and you can ever read the original values from it whenever you need, though you are using basically twice as much memory by using this method.
If the file has extremely long lines, the replaced string may contain a newline or there are other constraints preventing the use of File.ReadLines() while requiring streaming, there is an alternative solution using streams only, even though it is not trivial.
Implement your own stream decorator (wrapper) that performs the replacement. I.e. a class based on Stream that takes another stream in its constructor, reads data from the stream in its Read(byte[], int, int) override and performs the replacement in the buffer. See notes to Stream implementers for further requirements and suggestions.
Let's call the string being replaced "needle", the source stream "haystack" and the replacement string "replacement".
Needle and replacement need to be encoded using the encoding of the haystack contents (typically Encoding.UTF8.GetBytes()). Inside streams, the data is not converted to string, unlike in StreamReader.ReadLine(). Thus unnecessary memory allocation is prevented.
Simple cases: If both needle and replacement are just a single byte, the implementation is just a simple loop over the buffer, replacing all occurrences. If needle is a single byte and replacement is empty (i.e. deleting the byte, e.g. deleting carriage return for line ending normalization), it is a simple loop maintaining from and to indexes to the buffer, rewriting the buffer byte by byte.
In more complex cases, implement the KMP algorithm to perform the replacement.
Read the data from the underlying stream (haystack) to an internal buffer that is at least as long as needle and perform the replacement while rewriting the data to the output buffer. The internal buffer is needed so that data from a partial match are not published before a complete match is detected -- then, it would be too late to go back and delete the match completely.
Process the internal buffer byte by byte, feeding each byte into the KMP automaton. With each automaton update, write the bytes it releases to the appropriate position in output buffer.
When a match is detected by KMP, replace it: reset the automaton keeping the position in the internal buffer (which deletes the match) and write the replacement in the output buffer.
When end of either buffer is reached, keep the unwritten output and unprocessed part of the internal buffer including current partial match as a starting point for next call to the method and return the current output buffer. Next call to the method writes the remaining output and starts processing the rest of haystack where the current one stopped.
When end of haystack is reached, release the current partial match and write it to the output buffer.
Just be careful not to return an empty output buffer before processing all the data of haystack -- that would signal end of stream to the caller and therefore truncate the data.
I have a Json file that is sized 50GB and beyond.
Following is what I have written to read a very small chunk of the Json. I now need to modify this to read the large file.
internal static IEnumerable<T> ReadJson<T>(string filePath)
{
DataContractJsonSerializer ser = new DataContractJsonSerializer(typeof(T));
using (StreamReader sr = new StreamReader(filePath))
{
String line;
// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
byte[] jsonBytes = Encoding.UTF8.GetBytes(line);
XmlDictionaryReader jsonReader = JsonReaderWriterFactory.CreateJsonReader(jsonBytes, XmlDictionaryReaderQuotas.Max);
var myPerson = ser.ReadObject(jsonReader);
jsonReader.Close();
yield return (T)myPerson;
}
}
}
Would it suffice if I specify the buffer size while constructing the StreamReader in the current code?
Please correct me if I am wrong here.. The buffer size basically specifies how much data is read from disk to memory at a time. So if File is 100MB in size with buffer size as 5MB, it reads 5MB at a time to memory, until entire file is read.
Assuming my understanding of point 3 is right, what would be the ideal buffer size with such a large text file? Would int.Max size be a bad idea? In 64-bit PC int.Max size is 2147483647. I presume buffer size is in bytes, which evaluates to about 2GB. This itself could consume time. I was looking at something like 100MB - 300MB as buffer size.
It is going to read a line at a time (of the input file), which could be 10 bytes, and could be all 50GB. So it comes down to : how is the input file structured? And if the input JSON has newlines other than cleanly at the breaks between objects, this could get really ill.
The buffer size might impact how much it reads while looking for the end of each line, but ultimately: it needs to find a new-line each time (at least, how it is written currently).
I think you should first compare different parsers before worrying about details as the buffer size.
The differences between DataContractJsonSerializer, Raven JSON or Newtonsoft JSON will be quite significant.
So your main issue with this is where are your boundaries, and given that your doc is a JSON doc it seems to me likely that your boundaries are likely to be classes, I assume (or hope) that you don't have 1 honking great class that is 50gb large. I also assume that you don't really need all those classes in memory but you may need to search the whole thing for your subset...does that sound roughly right? if so I think that your pseudo code is something like
using a Json parser that accepts a streamreader (newtonsoft?)
read and parse until eof
yield return your parsed class that matches criteria
read and parse next class
end
I want to convert an image file to a string. The following works:
MemoryStream ms = new MemoryStream();
Image1.Save(ms, ImageFormat.Jpeg);
byte[] picture = ms.ToArray();
string formmattedPic = Convert.ToBase64String(picture);
However, when saving this to a XmlWriter, it takes ages before it's saved(20secs for a 26k image file). Is there a way to speed this action up?
Thanks,
Raks
There are three points where you are doing large operations needlessly:
Getting the stream's bytes
Converting it to Base64
Writing it to the XmlWriter.
Instead. First call Length and GetBuffer. This let's you operate upon the stream's buffer directly. (Do flush it first though).
Then, implement base-64 yourself. It's relatively simple as you take groups of 3 bytes, do some bit-twiddling to get the index into the character it'll be converted to, and then output that character. At the very end you add some = symbols according to how many bytes where in the last block sent (= for one remainder byte, == for two remainder bytes and none if there were no partial blocks).
Do this writting into a char buffer (a char[]). The most efficient size is a matter for experimentation but I'd start with 2048 characters. When you've filled the buffer, call XmlWriter.WriteRaw on it, and then start writing back at index 0 again.
This way, you're doing less allocations, and you're started on the output from the moment you've got your image loaded into the memory stream. Generally, this should result in better throughput.
I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
Network stream only has 3 bytes available
3 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
Network stream has all 4 bytes available
4 bytes read into local buffer
Buffer handed to Decoder
Decoder decodes 2 char, and keeps the remaining 4th bytes internally
String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3];
int charIndex = 0;
Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
// pretend 3 of the 6 bytes arrives in one packet
byte[] b1 = new byte[] { 0, 83, 0 };
int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
charIndex += charsRead;
// pretend the remaining 3 bytes plus a final byte, for something unrelated,
// arrive next
byte[] b2 = new byte[] { 71, 0, 114, 3 };
charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item
I have reproduced the problem you mentioned with BinaryReader.ReadChars.
Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.
There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.
Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.
I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.
Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).
The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.
This reminds of one of my own questions (Reading from a HttpResponseStream fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.
Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)
Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:
Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.
Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.
I'm working with Unity3D/Mono atm and the ReadChars-method might even contain more errors. I made a string like this:
mat.name = new string(binaryReader.ReadChars(64));
mat.name even contained the correct string, but I could just add strings before it. Everything after the string just disappered. Even with String.Format. My solution so far is not using the ReadChars-method, but read the data as byte array and convert it to a string:
byte[] str = binaryReader.ReadBytes(64);
int lengthOfStr = Array.IndexOf(str, (byte)0); // e.g. 4 for "clip\0"
mat.name = System.Text.ASCIIEncoding.Default.GetString(str, 0, lengthOfStr);