Why we use flush parameter with Encoder.GetBytes method - c#

This link explains the Encoder.GetBytes Method and there is a bool parameter called flush explained too . The explanation of flush is :
true if this encoder can flush its
state at the end of the conversion;
otherwise, false. To ensure correct
termination of a sequence of blocks of
encoded bytes, the last call to
GetBytes can specify a value of true
for flush.
but I didn't understand what flush does , maybe I am drunk or somthing :). can you explain it in more details please.

Suppose you receive data over a socket connection. You will receive a long text as several byte[] blocks.
It is possible that 1 Unicode character occupies 2+ bytes in a UTF-8 stream and that it is split over 2 byte blocks. Encoding the 2 byte blocks separately (and concatenating the strings) would produce an error.
So you can only specify flush=true on the last block. And of course, if you only have 1 block then that is also the last.
Tip: Use a TextReader and let it handle this problem(s) for you.
Edit
The mirror problem (that was actually asked: GetBytes) is slightly harder to explain.
Using flush=true is the same as using Encoder.Reset() after GetBytes(...). It clears the 'state' of the encoder,
including trailing characters at the end of the previous data block, such as an unmatched high surrogate
The basic idea is the same: when converting from string to blocks of bytes, or vice versa, the blocks are not independent.

Internally the Encoder would be implemented with a buffer - this buffer may need to be flushed (cleared) in order to end the read correctly or prepare the Encoder for the next read.
Here is one explanation of buffer flushing.
The exact usage of the flush parameter is described here:
true to clear the internal state of the encoder after the conversion; otherwise, false.

Flushing will reset the internal state of the encoder instance used to encode the text into bytes. Why does it need internal state, you ask? Well, to quote MSDN:
The flush parameter is useful for flushing a high-surrogate at the end of a stream
that does not have a low-surrogate. For example, the Encoder created by
UTF8Encoding.GetEncoder uses this parameter to determine whether to write out a
dangling high-surrogate at the end of a character block.
If you're using multiple GetBytes(), hence, you would want to flush the internal state at the end to terminate any character sequences that need terminating, but only at the end, since terminating sequences might otherwise be introduced in the middle of words.
Note that this may be a purely theoretical problem these days. And, you'd be better off using higher-level wrappers anyway. If you do, being drunk will not be a problem.

Related

Reading binary data in c# / ReadAsync not reading everything specified in the count

I read this blogpost https://jonskeet.uk/csharp/readbinary.html,
FileStream could be reading just the first 10 bytes of the file into the buffer. The Read method is only guaranteed to block until some data is available (or the end of the stream is reached)
Do you know if the same consideration needs to be done for ReadAsync.
Also do you know in which cases this condition will be hit, where Read / ReadAsync not reading everything specified in the count?
Also do you know in which cases this condition will be hit, where Read / ReadAsync not reading everything specified in the count?
When there is less data available then specified.
The most trivial one is where you hit end of file.
As MSDN put it:
The result value can be less than the number of bytes requested if the number of bytes currently available is less than the requested number, or it can be 0 (zero) if the end of the stream has been reached.
Also common; when reading from TCP/IP buffers, although usually that wouldn't be a FileStream. There are various other Stream inherited types, which share the same methods. See: System.IO.Stream

Replacing a string within a stream in C# (without overwriting the original file)

I have a file that I'm opening into a stream and passing to another method. However, I'd like to replace a string in the file before passing the stream to the other method. So:
string path = "C:/...";
Stream s = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read);
//need to replace all occurrences of "John" in the file to "Jack" here.
CallMethod(s);
The original file should not be modified, only the stream. What would be the easiest way to do this?
Thanks...
It's a lot easier if you just read in the file as lines, and then deal with those, instead of forcing yourself to stick with a Stream, simply because stream deals with both text and binary files, and needs to be able to read in one character at a time (which makes such replacement very hard). If you read in a whole line at a time (so long as you don't have multi-line replacement) it's quite easy.
var lines = File.ReadLines(path)
.Select(line => line.Replace("John", "Jack"));
Note that ReadLines still does stream the data, and Select doesn't need to materialize the whole thing, so you're still not reading the whole file into memory at one time when doing this.
If you don't actually need to stream the data you can easily just load it all as one big string, do the replace, and then create a stream based on that one string:
string data = File.ReadAllText(path)
.Replace("John", "Jack");
byte[] bytes = Encoding.ASCII.GetBytes(data);
Stream s = new MemoryStream(bytes);
This question probably has many good answers. I'll try one I've used and has always worked for me and my peers.
I suggest you create a separate stream, say a MemoryStream. Read from your filestream and write into the memory one. You can then extract strings from either and replace stuff, and you would pass the memory stream ahead. That makes it double sure that you are not messing up with the original stream, and you can ever read the original values from it whenever you need, though you are using basically twice as much memory by using this method.
If the file has extremely long lines, the replaced string may contain a newline or there are other constraints preventing the use of File.ReadLines() while requiring streaming, there is an alternative solution using streams only, even though it is not trivial.
Implement your own stream decorator (wrapper) that performs the replacement. I.e. a class based on Stream that takes another stream in its constructor, reads data from the stream in its Read(byte[], int, int) override and performs the replacement in the buffer. See notes to Stream implementers for further requirements and suggestions.
Let's call the string being replaced "needle", the source stream "haystack" and the replacement string "replacement".
Needle and replacement need to be encoded using the encoding of the haystack contents (typically Encoding.UTF8.GetBytes()). Inside streams, the data is not converted to string, unlike in StreamReader.ReadLine(). Thus unnecessary memory allocation is prevented.
Simple cases: If both needle and replacement are just a single byte, the implementation is just a simple loop over the buffer, replacing all occurrences. If needle is a single byte and replacement is empty (i.e. deleting the byte, e.g. deleting carriage return for line ending normalization), it is a simple loop maintaining from and to indexes to the buffer, rewriting the buffer byte by byte.
In more complex cases, implement the KMP algorithm to perform the replacement.
Read the data from the underlying stream (haystack) to an internal buffer that is at least as long as needle and perform the replacement while rewriting the data to the output buffer. The internal buffer is needed so that data from a partial match are not published before a complete match is detected -- then, it would be too late to go back and delete the match completely.
Process the internal buffer byte by byte, feeding each byte into the KMP automaton. With each automaton update, write the bytes it releases to the appropriate position in output buffer.
When a match is detected by KMP, replace it: reset the automaton keeping the position in the internal buffer (which deletes the match) and write the replacement in the output buffer.
When end of either buffer is reached, keep the unwritten output and unprocessed part of the internal buffer including current partial match as a starting point for next call to the method and return the current output buffer. Next call to the method writes the remaining output and starts processing the rest of haystack where the current one stopped.
When end of haystack is reached, release the current partial match and write it to the output buffer.
Just be careful not to return an empty output buffer before processing all the data of haystack -- that would signal end of stream to the caller and therefore truncate the data.

How to read bytes from SerialPort.BaseStream without Length

I want to use the stream class to read/write data to/from a serial port. I use the BaseStream to get the stream (link below) but the Length property doesn't work. Does anyone know how can I read the full buffer without knowing how many bytes there are?
http://msdn.microsoft.com/en-us/library/system.io.ports.serialport.basestream.aspx
You can't. That is, you can't guarantee that you've received everything if all you have is the BaseStream.
There are two ways you can know if you've received everything:
Send a length word as the first 2 or 4 bytes of the packet. That says how many bytes will follow. Your reader then reads that length word, reads that many bytes, and knows it's done.
Agree on a record separator. That works great for text. For example you might decide that a null byte or a end-of-line character signals the end of the data. This is somewhat more difficult to do with arbitrary binary data, but possible. See comment.
Or, depending on your application, you can do some kind of timing. That is, if you haven't received anything new for X number of seconds (or milliseconds?), you assume that you've received everything. That has the obvious drawback of not working well if the sender is especially slow.
Maybe you can try SerialPort.BytesToRead property.

Weird behavior with a BinaryReader

I have a socket-based application that exposes received data with a BinaryReader object on the client side. I've been trying to debug an issue where the data contained in the reader is not clean... i.e. the buffer that I'm reading contains old data past the size of the new data.
In the code below:
System.Diagnostics.Debug.WriteLine("Stream length: {0}", _binaryReader.BaseStream.Length);
byte[] buffer = _binaryReader.ReadBytes((int)_binaryReader.BaseStream.Length);
When I comment out the first line, the data doesn't end up being dirty (or, doesn't end up being dirty as regularly) as when I have that print line statement. As far as I can tell, from the server side the data is coming in cleanly, so it's possible that my socket implementation has some issues. But does anyone have any idea why adding that print line would cause the data to be dirty more often?
Your binary reader looks like it is a private member variable (if the leading underscore is a tell tell sign).
Is your application multithreaded? You could be experiencing a race condition if another thread is attempting to do also use your binaryReader while you are reading from it. The fact that you experience issues even without that line seems quite suspect to me.
Are you sure that your reading logic is correct? Stream.Length indicates the length of the entire stream, not of the remaining data to be read.
Suppose that, initially, 100 bytes were available. Length is 100, and BinaryReader corrects reads 100 bytes and advances the stream position by 100. Then, another 20 bytes arrive. Length is now 120; however, your BinaryReader should only be reading 20 bytes, not 120. The ‘extra’ 100 bytes requested in the second read would either cause it to block or (if the stream is not implemented correctly) break.
The problem was silly and unrelated. I believe my reading logic above is correct, however. The issue was that the _binaryReader I was using was a reference that was not owned by my class and hence the underlying stream was being rewritten with bad data.

How many bits does BinaryReader.PeekChar() read?

I am working on improving a stream reader class that uses a BinaryReader. It consists of a while loop that uses .PeekChar() to check if more data exists to continue processing.
The very first operation is a .ReadInt32() which reads 4 bytes. What if PeekChar only "saw" one byte (or one bit)? This doesn't seem like a reliable way of checking for EOF.
The BinaryReader is constructed using its default parameters, which as I understand it, uses UTF8 as the default encoding. I assume that .PeekChar() checks for 8 bits but I really am not sure.
How many bits does .PeekChar() look for? (and what are some alternate methods to checking for EOF?)
Here BinaryReader.PeekChar
I read:
ArgumentException: The current character cannot be decoded into the
internal character buffer by using the Encoding selected for the
stream.
This makes clear that amount of memory read depends on Encoding applied to that stream.
EDIT
Actually definition according to MSDN is:
Returns the next available character and does not advance the
byte or character position.*
Infact, it depends on encoding if this is a byte or more...
Hope this helps.
Making your Read*() calls blindly and handling any exceptions that are thrown is the normal method. I don't believe that the stream position is moved if anything goes wrong.
The PeekChar() method of BinaryReader is very buggy. Even when trying to read a from a memory stream with UTF8 encoded data, PeekChar() throws an exception after reading a particular length of the stream. The BCL team has acknowledged the issue, but they have not committed to resolving the issue. Their only response is to avoid using PeekChar() if you can.

Categories

Resources