Code Contracts Static Analysis: Prover Limitations? - c#

I've been playing with Code Contracts and I really like what I've seen so far. They encourage me to evaluate and explicitly declare my assumptions, which has already helped me to identify a few corner cases I hadn't considered in the code to which I'm adding contracts. Right now I'm playing with trying to enforce more sophisticated invariants. I have one case that currently fails proving and I'm curious if there is a way I can fix this besides simply adding Contract.Assume calls. Here is the class in question, stripped down for ease of reading:
public abstract class MemoryEncoder
{
private const int CapacityDelta = 16;
private int _currentByte;
/// <summary>
/// The current byte index in the encoding stream.
/// This should not need to be modified, under typical usage,
/// but can be used to randomly access the encoding region.
/// </summary>
public int CurrentByte
{
get
{
Contract.Ensures(Contract.Result<int>() >= 0);
Contract.Ensures(Contract.Result<int>() <= Length);
return _currentByte;
}
set
{
Contract.Requires(value >= 0);
Contract.Requires(value <= Length);
_currentByte = value;
}
}
/// <summary>
/// Current number of bytes encoded in the buffer.
/// This may be less than the size of the buffer (capacity).
/// </summary>
public int Length { get; private set; }
/// <summary>
/// The raw buffer encapsulated by the encoder.
/// </summary>
protected internal Byte[] Buffer { get; private set; }
/// <summary>
/// Reserve space in the encoder buffer for the specified number of new bytes
/// </summary>
/// <param name="bytesRequired">The number of bytes required</param>
protected void ReserveSpace(int bytesRequired)
{
Contract.Requires(bytesRequired > 0);
Contract.Ensures((Length - CurrentByte) >= bytesRequired);
//Check if these bytes would overflow the current buffer););
if ((CurrentByte + bytesRequired) > Buffer.Length)
{
//Create a new buffer with at least enough space for the additional bytes required
var newBuffer = new Byte[Buffer.Length + Math.Max(bytesRequired, CapacityDelta)];
//Copy the contents of the previous buffer and replace the original buffer reference
Buffer.CopyTo(newBuffer, 0);
Buffer = newBuffer;
}
//Check if the total length of written bytes has increased
if ((CurrentByte + bytesRequired) > Length)
{
Length = CurrentByte + bytesRequired;
}
}
[ContractInvariantMethod]
private void GlobalRules()
{
Contract.Invariant(Buffer != null);
Contract.Invariant(Length <= Buffer.Length);
Contract.Invariant(CurrentByte >= 0);
Contract.Invariant(CurrentByte <= Length);
}
}
I'm interested in how I can structure the Contract calls in ReserveSpace so that the class invariants are provable. In particular, it complains about (Length <= Buffer.Length) and (CurrentByte <= Length). It's reasonable to me that it can't see that (Length <= Buffer.Length) is satisfied, since it's creating a new buffer and reassigning the reference. Is my only option to add an Assume that the invariants are satisfied?

After fighting with this for a while, I came up with this provable solution (constructor is a dummy to allow for isolated testing):
public abstract class MemoryEncoder
{
private const int CapacityDelta = 16;
private byte[] _buffer;
private int _currentByte;
private int _length;
protected MemoryEncoder()
{
Buffer = new byte[500];
Length = 0;
CurrentByte = 0;
}
/// <summary>
/// The current byte index in the encoding stream.
/// This should not need to be modified, under typical usage,
/// but can be used to randomly access the encoding region.
/// </summary>
public int CurrentByte
{
get
{
return _currentByte;
}
set
{
Contract.Requires(value >= 0);
Contract.Requires(value <= Length);
_currentByte = value;
}
}
/// <summary>
/// Current number of bytes encoded in the buffer.
/// This may be less than the size of the buffer (capacity).
/// </summary>
public int Length
{
get { return _length; }
private set
{
Contract.Requires(value >= 0);
Contract.Requires(value <= _buffer.Length);
Contract.Requires(value >= CurrentByte);
Contract.Ensures(_length <= _buffer.Length);
_length = value;
}
}
/// <summary>
/// The raw buffer encapsulated by the encoder.
/// </summary>
protected internal Byte[] Buffer
{
get { return _buffer; }
private set
{
Contract.Requires(value != null);
Contract.Requires(value.Length >= _length);
_buffer = value;
}
}
/// <summary>
/// Reserve space in the encoder buffer for the specified number of new bytes
/// </summary>
/// <param name="bytesRequired">The number of bytes required</param>
protected void ReserveSpace(int bytesRequired)
{
Contract.Requires(bytesRequired > 0);
Contract.Ensures((Length - CurrentByte) >= bytesRequired);
//Check if these bytes would overflow the current buffer););
if ((CurrentByte + bytesRequired) > Buffer.Length)
{
//Create a new buffer with at least enough space for the additional bytes required
var newBuffer = new Byte[Buffer.Length + Math.Max(bytesRequired, CapacityDelta)];
//Copy the contents of the previous buffer and replace the original buffer reference
Buffer.CopyTo(newBuffer, 0);
Buffer = newBuffer;
}
//Check if the total length of written bytes has increased
if ((CurrentByte + bytesRequired) > Length)
{
Contract.Assume(CurrentByte + bytesRequired <= _buffer.Length);
Length = CurrentByte + bytesRequired;
}
}
[ContractInvariantMethod]
private void GlobalRules()
{
Contract.Invariant(_buffer != null);
Contract.Invariant(_length <= _buffer.Length);
Contract.Invariant(_currentByte >= 0);
Contract.Invariant(_currentByte <= _length);
}
}
The main thing I noticed is that placing invariants on properties gets messy, but seems to solve more easily with invariants on fields. It was also important to place appropriate contractual obligations in the property accessors. I'll have to keep experimenting and see what works and what doesn't. It's an interesting system, but I'd definitely like to know more if anybody has a good 'cheat sheet' on how the prover works.

Related

MultipartFormData File Uploading out of memory exception

I am using this code for uploading a file :
https://gist.github.com/bgrins/1789787
But if I am trying to use this code for uploading a file "2 GB" file I am getting out of memory exception and the reason in this line :
https://gist.github.com/bgrins/1789787#file-gistfile1-cs-L75
so how can I fix this issue?
Read giant file piece by piece, and upload pieces one by one. you could provide a progress bar also.
upload code piece by piece : How to read a big file piece by piece in C#
in server side, append new pieces to a file: C# Append byte array to existing file
you can detail the code with this idea. I did it once last year, but cannot share the code.
There are more than one solution
1- Writing to RequestStream directly instead of writing to MemoryStream :
https://blogs.msdn.microsoft.com/johan/2006/11/15/are-you-getting-outofmemoryexceptions-when-uploading-large-files/
public static string MyUploader(string strFileToUpload, string strUrl)
{
string strFileFormName = "file";
Uri oUri = new Uri(strUrl);
string strBoundary = "----------" + DateTime.Now.Ticks.ToString("x");
// The trailing boundary string
byte[] boundaryBytes = Encoding.ASCII.GetBytes("\r\n--" + strBoundary + "\r\n");
// The post message header
StringBuilder sb = new StringBuilder();
sb.Append("--");
sb.Append(strBoundary);
sb.Append("\r\n");
sb.Append("Content-Disposition: form-data; name=\"");
sb.Append(strFileFormName);
sb.Append("\"; filename=\"");
sb.Append(Path.GetFileName(strFileToUpload));
sb.Append("\"");
sb.Append("\r\n");
sb.Append("Content-Type: ");
sb.Append("application/octet-stream");
sb.Append("\r\n");
sb.Append("\r\n");
string strPostHeader = sb.ToString();
byte[] postHeaderBytes = Encoding.UTF8.GetBytes(strPostHeader);
// The WebRequest
HttpWebRequest oWebrequest = (HttpWebRequest)WebRequest.Create(oUri);
oWebrequest.ContentType = "multipart/form-data; boundary=" + strBoundary;
oWebrequest.Method = "POST";
// This is important, otherwise the whole file will be read to memory anyway...
oWebrequest.AllowWriteStreamBuffering = false;
// Get a FileStream and set the final properties of the WebRequest
FileStream oFileStream = new FileStream(strFileToUpload, FileMode.Open, FileAccess.Read);
long length = postHeaderBytes.Length + oFileStream.Length + boundaryBytes.Length;
oWebrequest.ContentLength = length;
Stream oRequestStream = oWebrequest.GetRequestStream();
// Write the post header
oRequestStream.Write(postHeaderBytes, 0, postHeaderBytes.Length);
// Stream the file contents in small pieces (4096 bytes, max).
byte[] buffer = new Byte[checked((uint)Math.Min(4096, (int)oFileStream.Length))];
int bytesRead = 0;
while ((bytesRead = oFileStream.Read(buffer, 0, buffer.Length)) != 0)
oRequestStream.Write(buffer, 0, bytesRead);
oFileStream.Close();
// Add the trailing boundary
oRequestStream.Write(boundaryBytes, 0, boundaryBytes.Length);
WebResponse oWResponse = oWebrequest.GetResponse();
Stream s = oWResponse.GetResponseStream();
StreamReader sr = new StreamReader(s);
String sReturnString = sr.ReadToEnd();
// Clean up
oFileStream.Close();
oRequestStream.Close();
s.Close();
sr.Close();
return sReturnString;
}
2- Using RecyclableMemoryStream instead of MemoryStream solution
You can read more about RecyclableMemoryStream here :
http://www.philosophicalgeek.com/2015/02/06/announcing-microsoft-io-recycablememorystream/
https://github.com/Microsoft/Microsoft.IO.RecyclableMemoryStream
3- Using MemoryTributary instead of MemoryStream
You can read more about MemoryTributary here :
https://www.codeproject.com/Articles/348590/A-replacement-for-MemoryStream?msg=5257615#xx5257615xx
using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.InteropServices;
namespace LiquidEngine.Tools
{
/// <summary>
/// MemoryTributary is a re-implementation of MemoryStream that uses a dynamic list of byte arrays as a backing store, instead of a single byte array, the allocation
/// of which will fail for relatively small streams as it requires contiguous memory.
/// </summary>
public class MemoryTributary : Stream /* http://msdn.microsoft.com/en-us/library/system.io.stream.aspx */
{
#region Constructors
public MemoryTributary()
{
Position = 0;
}
public MemoryTributary(byte[] source)
{
this.Write(source, 0, source.Length);
Position = 0;
}
/* length is ignored because capacity has no meaning unless we implement an artifical limit */
public MemoryTributary(int length)
{
SetLength(length);
Position = length;
byte[] d = block; //access block to prompt the allocation of memory
Position = 0;
}
#endregion
#region Status Properties
public override bool CanRead
{
get { return true; }
}
public override bool CanSeek
{
get { return true; }
}
public override bool CanWrite
{
get { return true; }
}
#endregion
#region Public Properties
public override long Length
{
get { return length; }
}
public override long Position { get; set; }
#endregion
#region Members
protected long length = 0;
protected long blockSize = 65536;
protected List<byte[]> blocks = new List<byte[]>();
#endregion
#region Internal Properties
/* Use these properties to gain access to the appropriate block of memory for the current Position */
/// <summary>
/// The block of memory currently addressed by Position
/// </summary>
protected byte[] block
{
get
{
while (blocks.Count <= blockId)
blocks.Add(new byte[blockSize]);
return blocks[(int)blockId];
}
}
/// <summary>
/// The id of the block currently addressed by Position
/// </summary>
protected long blockId
{
get { return Position / blockSize; }
}
/// <summary>
/// The offset of the byte currently addressed by Position, into the block that contains it
/// </summary>
protected long blockOffset
{
get { return Position % blockSize; }
}
#endregion
#region Public Stream Methods
public override void Flush()
{
}
public override int Read(byte[] buffer, int offset, int count)
{
long lcount = (long)count;
if (lcount < 0)
{
throw new ArgumentOutOfRangeException("count", lcount, "Number of bytes to copy cannot be negative.");
}
long remaining = (length - Position);
if (lcount > remaining)
lcount = remaining;
if (buffer == null)
{
throw new ArgumentNullException("buffer", "Buffer cannot be null.");
}
if (offset < 0)
{
throw new ArgumentOutOfRangeException("offset",offset,"Destination offset cannot be negative.");
}
int read = 0;
long copysize = 0;
do
{
copysize = Math.Min(lcount, (blockSize - blockOffset));
Buffer.BlockCopy(block, (int)blockOffset, buffer, offset, (int)copysize);
lcount -= copysize;
offset += (int)copysize;
read += (int)copysize;
Position += copysize;
} while (lcount > 0);
return read;
}
public override long Seek(long offset, SeekOrigin origin)
{
switch (origin)
{
case SeekOrigin.Begin:
Position = offset;
break;
case SeekOrigin.Current:
Position += offset;
break;
case SeekOrigin.End:
Position = Length - offset;
break;
}
return Position;
}
public override void SetLength(long value)
{
length = value;
}
public override void Write(byte[] buffer, int offset, int count)
{
long initialPosition = Position;
int copysize;
try
{
do
{
copysize = Math.Min(count, (int)(blockSize - blockOffset));
EnsureCapacity(Position + copysize);
Buffer.BlockCopy(buffer, (int)offset, block, (int)blockOffset, copysize);
count -= copysize;
offset += copysize;
Position += copysize;
} while (count > 0);
}
catch (Exception e)
{
Position = initialPosition;
throw e;
}
}
public override int ReadByte()
{
if (Position >= length)
return -1;
byte b = block[blockOffset];
Position++;
return b;
}
public override void WriteByte(byte value)
{
EnsureCapacity(Position + 1);
block[blockOffset] = value;
Position++;
}
protected void EnsureCapacity(long intended_length)
{
if (intended_length > length)
length = (intended_length);
}
#endregion
#region IDispose
/* http://msdn.microsoft.com/en-us/library/fs2xkftw.aspx */
protected override void Dispose(bool disposing)
{
/* We do not currently use unmanaged resources */
base.Dispose(disposing);
}
#endregion
#region Public Additional Helper Methods
/// <summary>
/// Returns the entire content of the stream as a byte array. This is not safe because the call to new byte[] may
/// fail if the stream is large enough. Where possible use methods which operate on streams directly instead.
/// </summary>
/// <returns>A byte[] containing the current data in the stream</returns>
public byte[] ToArray()
{
long firstposition = Position;
Position = 0;
byte[] destination = new byte[Length];
Read(destination, 0, (int)Length);
Position = firstposition;
return destination;
}
/// <summary>
/// Reads length bytes from source into the this instance at the current position.
/// </summary>
/// <param name="source">The stream containing the data to copy</param>
/// <param name="length">The number of bytes to copy</param>
public void ReadFrom(Stream source, long length)
{
byte[] buffer = new byte[4096];
int read;
do
{
read = source.Read(buffer, 0, (int)Math.Min(4096, length));
length -= read;
this.Write(buffer, 0, read);
} while (length > 0);
}
/// <summary>
/// Writes the entire stream into destination, regardless of Position, which remains unchanged.
/// </summary>
/// <param name="destination">The stream to write the content of this stream to</param>
public void WriteTo(Stream destination)
{
long initialpos = Position;
Position = 0;
this.CopyTo(destination);
Position = initialpos;
}
#endregion
}
}

C# Text File Reading ObjectDisposedException: The object was used after being disposed

I want to read the last line of a text file. I'm using a solution that's suggested here:
How to efficiently read only last line of the text file
Using that library, I'm getting an error saying the stream is disposed. But I'm confused as I'm declaring the stream during every frame.
FileStream fileStream = new FileStream("C:\\Users\\LukasRoper\\Desktop\\Test.log", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
ReverseLineReader reverseLineReader = new ReverseLineReader(() => fileStream, Encoding.UTF8);
List<string> stringParts = new List<string>();
do
{
IEnumerable<string> line = reverseLineReader.Take(1);
string data = line.First();
stringParts = data.Split(',').ToList();
} while (stringParts.Count != 9);
I should explain I'm trying to read from a file that another program is writing to at the same time and I can't amend that program as its third party software. Can anybody explain why my FileStream becomes disposed?
The Reverse File Reader is here:
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Text;
namespace MiscUtil.IO
{
/// <summary>
/// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream
/// (or a filename for convenience) and yields lines from the end of the stream backwards.
/// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream
/// returned by the function must be seekable.
/// </summary>
public sealed class ReverseLineReader : IEnumerable<string>
{
/// <summary>
/// Buffer size to use by default. Classes with internal access can specify
/// a different buffer size - this is useful for testing.
/// </summary>
private const int DefaultBufferSize = 4096;
/// <summary>
/// Means of creating a Stream to read from.
/// </summary>
private readonly Func<Stream> streamSource;
/// <summary>
/// Encoding to use when converting bytes to text
/// </summary>
private readonly Encoding encoding;
/// <summary>
/// Size of buffer (in bytes) to read each time we read from the
/// stream. This must be at least as big as the maximum number of
/// bytes for a single character.
/// </summary>
private readonly int bufferSize;
/// <summary>
/// Function which, when given a position within a file and a byte, states whether
/// or not the byte represents the start of a character.
/// </summary>
private Func<long,byte,bool> characterStartDetector;
/// <summary>
/// Creates a LineReader from a stream source. The delegate is only
/// called when the enumerator is fetched. UTF-8 is used to decode
/// the stream into text.
/// </summary>
/// <param name="streamSource">Data source</param>
public ReverseLineReader(Func<Stream> streamSource)
: this(streamSource, Encoding.UTF8)
{
}
/// <summary>
/// Creates a LineReader from a filename. The file is only opened
/// (or even checked for existence) when the enumerator is fetched.
/// UTF8 is used to decode the file into text.
/// </summary>
/// <param name="filename">File to read from</param>
public ReverseLineReader(string filename)
: this(filename, Encoding.UTF8)
{
}
/// <summary>
/// Creates a LineReader from a filename. The file is only opened
/// (or even checked for existence) when the enumerator is fetched.
/// </summary>
/// <param name="filename">File to read from</param>
/// <param name="encoding">Encoding to use to decode the file into text</param>
public ReverseLineReader(string filename, Encoding encoding)
: this(() => File.OpenRead(filename), encoding)
{
}
/// <summary>
/// Creates a LineReader from a stream source. The delegate is only
/// called when the enumerator is fetched.
/// </summary>
/// <param name="streamSource">Data source</param>
/// <param name="encoding">Encoding to use to decode the stream into text</param>
public ReverseLineReader(Func<Stream> streamSource, Encoding encoding)
: this(streamSource, encoding, DefaultBufferSize)
{
}
internal ReverseLineReader(Func<Stream> streamSource, Encoding encoding, int bufferSize)
{
this.streamSource = streamSource;
this.encoding = encoding;
this.bufferSize = bufferSize;
if (encoding.IsSingleByte)
{
// For a single byte encoding, every byte is the start (and end) of a character
characterStartDetector = (pos, data) => true;
}
else if (encoding is UnicodeEncoding)
{
// For UTF-16, even-numbered positions are the start of a character
characterStartDetector = (pos, data) => (pos & 1) == 0;
}
else if (encoding is UTF8Encoding)
{
// For UTF-8, bytes with the top bit clear or the second bit set are the start of a character
// See http://www.cl.cam.ac.uk/~mgk25/unicode.html
characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0;
}
else
{
throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted");
}
}
/// <summary>
/// Returns the enumerator reading strings backwards. If this method discovers that
/// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown.
/// </summary>
public IEnumerator<string> GetEnumerator()
{
Stream stream = streamSource();
if (!stream.CanSeek)
{
stream.Dispose();
throw new NotSupportedException("Unable to seek within stream");
}
if (!stream.CanRead)
{
stream.Dispose();
throw new NotSupportedException("Unable to read within stream");
}
return GetEnumeratorImpl(stream);
}
private IEnumerator<string> GetEnumeratorImpl(Stream stream)
{
try
{
long position = stream.Length;
if (encoding is UnicodeEncoding && (position & 1) != 0)
{
throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length.");
}
// Allow up to two bytes for data from the start of the previous
// read which didn't quite make it as full characters
byte[] buffer = new byte[bufferSize + 2];
char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)];
int leftOverData = 0;
String previousEnd = null;
// TextReader doesn't return an empty string if there's line break at the end
// of the data. Therefore we don't return an empty string if it's our *first*
// return.
bool firstYield = true;
// A line-feed at the start of the previous buffer means we need to swallow
// the carriage-return at the end of this buffer - hence this needs declaring
// way up here!
bool swallowCarriageReturn = false;
while (position > 0)
{
int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize);
position -= bytesToRead;
stream.Position = position;
StreamUtil.ReadExactly(stream, buffer, bytesToRead);
// If we haven't read a full buffer, but we had bytes left
// over from before, copy them to the end of the buffer
if (leftOverData > 0 && bytesToRead != bufferSize)
{
// Buffer.BlockCopy doesn't document its behaviour with respect
// to overlapping data: we *might* just have read 7 bytes instead of
// 8, and have two bytes to copy...
Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData);
}
// We've now *effectively* read this much data.
bytesToRead += leftOverData;
int firstCharPosition = 0;
while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition]))
{
firstCharPosition++;
// Bad UTF-8 sequences could trigger this. For UTF-8 we should always
// see a valid character start in every 3 bytes, and if this is the start of the file
// so we've done a short read, we should have the character start
// somewhere in the usable buffer.
if (firstCharPosition == 3 || firstCharPosition == bytesToRead)
{
throw new InvalidDataException("Invalid UTF-8 data");
}
}
leftOverData = firstCharPosition;
int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0);
int endExclusive = charsRead;
for (int i = charsRead - 1; i >= 0; i--)
{
char lookingAt = charBuffer[i];
if (swallowCarriageReturn)
{
swallowCarriageReturn = false;
if (lookingAt == '\r')
{
endExclusive--;
continue;
}
}
// Anything non-line-breaking, just keep looking backwards
if (lookingAt != '\n' && lookingAt != '\r')
{
continue;
}
// End of CRLF? Swallow the preceding CR
if (lookingAt == '\n')
{
swallowCarriageReturn = true;
}
int start = i + 1;
string bufferContents = new string(charBuffer, start, endExclusive - start);
endExclusive = i;
string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd;
if (!firstYield || stringToYield.Length != 0)
{
yield return stringToYield;
}
firstYield = false;
previousEnd = null;
}
previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd);
// If we didn't decode the start of the array, put it at the end for next time
if (leftOverData != 0)
{
Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData);
}
}
if (leftOverData != 0)
{
// At the start of the final buffer, we had the end of another character.
throw new InvalidDataException("Invalid UTF-8 data at start of stream");
}
if (firstYield && string.IsNullOrEmpty(previousEnd))
{
yield break;
}
yield return previousEnd ?? "";
}
finally
{
stream.Dispose();
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
}
I do see that dispose is called on the stream, but doesn't redeclaring it fix that? That class was copied from here: How to read a text file reversely with iterator in C#
Thanks,
Your finally clause in private IEnumerator<string> GetEnumeratorImpl(Stream stream) is disposing your stream after you read from it. Generally you should disposing an object in the same scope or class you create it in.
In this case, remove all disposes in ReverseLineReader and wrap your original code in a using:
using (FileStream fileStream = new FileStream(...))
{
...
do
{
...
} while(...);
}

Do Networkstreams have racing conditions? [duplicate]

So, it would seem that a blocking Read() can return before it is done receiving all of the data being sent to it. In turn we wrap the Read() with a loop that is controlled by the DataAvailable value from the stream in question. The problem is that you can receive more data while in this while loop, but there is no behind the scenes processing going on to let the system know this. Most of the solutions I have found to this on the net have not been applicable in one way or another to me.
What I have ended up doing is as the last step in my loop, I do a simple Thread.Sleep(1) after reading each block from the stream. This appears to give the system time to update and I am not getting accurate results but this seems a bit hacky and quite a bit 'circumstantial' for a solution.
Here is a list of the circumstances I am dealing with: Single TCP Connection between an IIS Application and a standalone application, both written in C# for send/receive communication. It sends a request and then waits for a response. This request is initiated by an HTTP request, but I am not having this issue reading data from the HTTP Request, it is after the fact.
Here is the basic code for handling an incoming connection
protected void OnClientCommunication(TcpClient oClient)
{
NetworkStream stream = oClient.GetStream();
MemoryStream msIn = new MemoryStream();
byte[] aMessage = new byte[4096];
int iBytesRead = 0;
while ( stream.DataAvailable )
{
int iRead = stream.Read(aMessage, 0, aMessage.Length);
iBytesRead += iRead;
msIn.Write(aMessage, 0, iRead);
Thread.Sleep(1);
}
MemoryStream msOut = new MemoryStream();
// .. Do some processing adding data to the msOut stream
msOut.WriteTo(stream);
stream.Flush();
oClient.Close();
}
All feedback welcome for a better solution or just a thumbs up on needing to give that Sleep(1) a go to allow things to update properly before we check the DataAvailable value.
Guess I am hoping after 2 years that the answer to this question isn't how things still are :)
You have to know how much data you need to read; you cannot simply loop reading data until there is no more data, because you can never be sure that no more is going to come.
This is why HTTP GET results have a byte count in the HTTP headers: so the client side will know when it has received all the data.
Here are two solutions for you depending on whether you have control over what the other side is sending:
Use "framing" characters: (SB)data(EB), where SB and EB are start-block and end-block characters (of your choosing) but which CANNOT occur inside the data. When you "see" EB, you know you are done.
Implement a length field in front of each message to indicate how much data follows: (len)data. Read (len), then read (len) bytes; repeat as necessary.
This isn't like reading from a file where a zero-length read means end-of-data (that DOES mean the other side has disconnected, but that's another story).
A third (not recommended) solution is that you can implement a timer. Once you start getting data, set the timer. If the receive loop is idle for some period of time (say a few seconds, if data doesn't come often), you can probably assume no more data is coming. This last method is a last resort... it's not very reliable, hard to tune, and it's fragile.
I'm seeing a problem with this.
You're expecting that the communication will be faster than the while() loop, which is very unlikely.
The while() loop will finish as soon as there is no more data, which may not be the case a few milliseconds just after it exits.
Are you expecting a certain amount of bytes?
How often is OnClientCommunication() fired? Who triggers it?
What do you do with the data after the while() loop? Do you keep appending to previous data?
DataAvailable WILL return false because you're reading faster than the communication, so that's fine only if you keep coming back to this code block to process more data coming in.
I was trying to check DataAvailable before reading data from a network stream and it would return false, although after reading a single byte it would return true. So I checked the MSDN documentation and they also read before checking. I would re-arrange the while loop to a do while loop to follow this pattern.
http://msdn.microsoft.com/en-us/library/system.net.sockets.networkstream.dataavailable.aspx
// Check to see if this NetworkStream is readable.
if(myNetworkStream.CanRead){
byte[] myReadBuffer = new byte[1024];
StringBuilder myCompleteMessage = new StringBuilder();
int numberOfBytesRead = 0;
// Incoming message may be larger than the buffer size.
do{
numberOfBytesRead = myNetworkStream.Read(myReadBuffer, 0, myReadBuffer.Length);
myCompleteMessage.AppendFormat("{0}", Encoding.ASCII.GetString(myReadBuffer, 0, numberOfBytesRead));
}
while(myNetworkStream.DataAvailable);
// Print out the received message to the console.
Console.WriteLine("You received the following message : " +
myCompleteMessage);
}
else{
Console.WriteLine("Sorry. You cannot read from this NetworkStream.");
}
When I have this code:
var readBuffer = new byte[1024];
using (var memoryStream = new MemoryStream())
{
do
{
int numberOfBytesRead = networkStream.Read(readBuffer, 0, readBuffer.Length);
memoryStream.Write(readBuffer, 0, numberOfBytesRead);
}
while (networkStream.DataAvailable);
}
From what I can observe:
When sender sends 1000 bytes and reader wants to read them. Then I suspect that NetworkStream somehow "knows" that it should receive 1000 bytes.
When I call .Read before any data arrives from NetworkStream then .Read should be blocking until it gets more than 0 bytes (or more if .NoDelay is false on networkStream)
Then when I read first batch of data I suspect that .Read is somehow updating from its result the counter of those 1000 bytes at NetworkStream and before this happens I suspect, that in this time the .DataAvailable is set to false and after the counter is updated then the .DataAvailable is then set to correct value if the counter data is less than 1000 bytes. It makes sense when you think about it. Because otherwise it would go to the next cycle before checking that 1000 bytes arrived and the .Read method would be blocking indefinitely, because reader could have already read 1000 bytes and no more data would arrive.
This I think is the point of failure here as already James said:
Yes, this is just the way these libraries work. They need to be given time to run to fully validate the data incoming. – James Apr 20 '16 at 5:24
I suspect that the update of internal counter between end of .Read and before accessing .DataAvailable is not as atomic operation (transaction) so the TcpClient needs more time to properly set the DataAvailable.
When I have this code:
var readBuffer = new byte[1024];
using (var memoryStream = new MemoryStream())
{
do
{
int numberOfBytesRead = networkStream.Read(readBuffer, 0, readBuffer.Length);
memoryStream.Write(readBuffer, 0, numberOfBytesRead);
if (!networkStream.DataAvailable)
System.Threading.Thread.Sleep(1); //Or 50 for non-believers ;)
}
while (networkStream.DataAvailable);
}
Then the NetworkStream have enough time to properly set .DataAvailable and this method should function correctly.
Fun fact... This seems to be somehow OS Version dependent. Because the first function without sleep worked for me on Win XP and Win 10, but was failing to receive whole 1000 bytes on Win 7. Don't ask me why, but I tested it quite thoroughly and it was easily reproducible.
Using TcpClient.Available will allow this code to read exactly what is available each time. TcpClient.Available is automatically set to TcpClient.ReceiveBufferSize when the amount of data remaining to be read is greater than or equal to TcpClient.ReceiveBufferSize. Otherwise it is set to the size of the remaining data.
Hence, you can indicate the maximum amount of data that is available for each read by setting TcpClient.ReceiveBufferSize (e.g., oClient.ReceiveBufferSize = 4096;).
protected void OnClientCommunication(TcpClient oClient)
{
NetworkStream stream = oClient.GetStream();
MemoryStream msIn = new MemoryStream();
byte[] aMessage;
oClient.ReceiveBufferSize = 4096;
int iBytesRead = 0;
while (stream.DataAvailable)
{
int myBufferSize = (oClient.Available < 1) ? 1 : oClient.Available;
aMessage = new byte[oClient.Available];
int iRead = stream.Read(aMessage, 0, aMessage.Length);
iBytesRead += iRead;
msIn.Write(aMessage, 0, iRead);
}
MemoryStream msOut = new MemoryStream();
// .. Do some processing adding data to the msOut stream
msOut.WriteTo(stream);
stream.Flush();
oClient.Close();
}
public class NetworkStream
{
private readonly Socket m_Socket;
public NetworkStream(Socket socket)
{
m_Socket = socket ?? throw new ArgumentNullException(nameof(socket));
}
public void Send(string message)
{
if (message is null)
{
throw new ArgumentNullException(nameof(message));
}
byte[] data = Encoding.UTF8.GetBytes(message);
SendInternal(data);
}
public string Receive()
{
byte[] buffer = ReceiveInternal();
string message = Encoding.UTF8.GetString(buffer);
return message;
}
private void SendInternal(byte[] message)
{
int size = message.Length;
if (size == 0)
{
m_Socket.Send(BitConverter.GetBytes(size), 0, sizeof(int), SocketFlags.None);
}
else
{
m_Socket.Send(BitConverter.GetBytes(size), 0, sizeof(int), SocketFlags.None);
m_Socket.Send(message, 0, size, SocketFlags.None);
}
}
private byte[] ReceiveInternal()
{
byte[] sizeData = CommonReceiveMessage(sizeof(int));
int size = BitConverter.ToInt32(sizeData);
if (size == 0)
{
return Array.Empty<byte>();
}
return CommonReceiveMessage(size);
}
private byte[] CommonReceiveMessage(int messageLength)
{
if (messageLength < 0)
{
throw new ArgumentOutOfRangeException(nameof(messageLength), messageLength, "Размер сообщения не может быть меньше нуля.");
}
if (messageLength == 0)
{
return Array.Empty<byte>();
}
byte[] buffer = new byte[m_Socket.ReceiveBufferSize];
int currentLength = 0;
int receivedDataLength;
using (MemoryStream memoryStream = new())
{
do
{
receivedDataLength = m_Socket.Receive(buffer, 0, m_Socket.ReceiveBufferSize, SocketFlags.None);
currentLength += receivedDataLength;
memoryStream.Write(buffer, 0, receivedDataLength);
}
while (currentLength < messageLength);
return memoryStream.ToArray();
}
}
}
This example presents an algorithm for sending and receiving data, namely text messages. You can also send files.
using System;
using System.IO;
using System.Net.Sockets;
using System.Text;
namespace Network
{
/// <summary>
/// Represents a network stream for transferring data.
/// </summary>
public class NetworkStream
{
#region Fields
private static readonly byte[] EmptyArray = Array.Empty<byte>();
private readonly Socket m_Socket;
#endregion
#region Constructors
/// <summary>
/// Initializes a new instance of the class <seealso cref="NetworkStream"/>.
/// </summary>
/// <param name="socket">
/// Berkeley socket interface.
/// </param>
public NetworkStream(Socket socket)
{
m_Socket = socket ?? throw new ArgumentNullException(nameof(socket));
}
#endregion
#region Properties
#endregion
#region Methods
/// <summary>
/// Sends a message.
/// </summary>
/// <param name="message">
/// Message text.
/// </param>
/// <exception cref="ArgumentNullException"/>
public void Send(string message)
{
if (message is null)
{
throw new ArgumentNullException(nameof(message));
}
byte[] data = Encoding.UTF8.GetBytes(message);
Write(data);
}
/// <summary>
/// Receives the sent message.
/// </summary>
/// <returns>
/// Sent message.
/// </returns>
public string Receive()
{
byte[] data = Read();
return Encoding.UTF8.GetString(data);
}
/// <summary>
/// Receives the specified number of bytes from a bound <seealso cref="Socket"/>.
/// </summary>
/// <param name="socket">
/// <seealso cref="Socket"/> for receiving data.
/// </param>
/// <param name="size">
/// The size of the received data.
/// </param>
/// <returns>
/// Returns an array of received data.
/// </returns>
private byte[] Read(int size)
{
if (size < 0)
{
// You can throw an exception.
return null;
}
if (size == 0)
{
// Don't throw an exception here, just return an empty data array.
return EmptyArray;
}
// There are many examples on the Internet where the
// Socket.Available property is used, this is WRONG!
// Important! The Socket.Available property is not working as expected.
// Data packages may be in transit, but the Socket.Available property may indicate otherwise.
// Therefore, we use a counter that will allow us to receive all data packets, no more and no less.
// The cycle will continue until we receive all the data packets or the timeout is triggered.
// Note. This algorithm is not designed to work with big data.
SimpleCounter counter = new(size, m_Socket.ReceiveBufferSize);
byte[] buffer = new byte[counter.BufferSize];
int received;
using MemoryStream storage = new();
// The cycle will run until we get all the data.
while (counter.IsExpected)
{
received = m_Socket.Receive(buffer, 0, counter.Available, SocketFlags.None);
// Pass the size of the received data to the counter.
counter.Count(received);
// Write data to memory.
storage.Write(buffer, 0, received);
}
return storage.ToArray();
}
/// <summary>
/// Receives the specified number of bytes from a bound <seealso cref="Socket"/>.
/// </summary>
/// <returns>
/// Returns an array of received data.
/// </returns>
private byte[] Read()
{
byte[] sizeData;
// First, we get the size of the master data.
sizeData = Read(sizeof(int));
// We convert the received data into a number.
int size = BitConverter.ToInt32(sizeData);
// If the data size is less than 0 then throws an exception.
// We inform the recipient that an error occurred while reading the data.
if (size < 0)
{
// Or return the value null.
throw new SocketException();
}
// If the data size is 0, then we will return an empty array.
// Do not allow an exception here.
if (size == 0)
{
return EmptyArray;
}
// Here we read the master data.
byte[] data = Read(size);
return data;
}
/// <summary>
/// Writes data to the stream.
/// </summary>
/// <param name="data"></param>
private void Write(byte[] data)
{
if (data is null)
{
// Throw an exception.
// Or send a negative number that will represent the value null.
throw new ArgumentNullException(nameof(data));
}
byte[] sizeData = BitConverter.GetBytes(data.Length);
// In any case, we inform the recipient about the size of the data.
m_Socket.Send(sizeData, 0, sizeof(int), SocketFlags.None);
if (data.Length != 0)
{
// We send data whose size is greater than zero.
m_Socket.Send(data, 0, data.Length, SocketFlags.None);
}
}
#endregion
#region Classes
/// <summary>
/// Represents a simple counter of received data over the network.
/// </summary>
private class SimpleCounter
{
#region Fields
private int m_Received;
private int m_Available;
private bool m_IsExpected;
#endregion
#region Constructors
/// <summary>
/// Initializes a new instance of the class <seealso cref="SimpleCounter"/>.
/// </summary>
/// <param name="dataSize">
/// Data size.
/// </param>
/// <param name="bufferSize">
/// Buffer size.
/// </param>
/// <exception cref="ArgumentOutOfRangeException"/>
public SimpleCounter(int dataSize, int bufferSize)
{
if (dataSize < 0)
{
throw new ArgumentOutOfRangeException(nameof(dataSize), dataSize, "Data size cannot be less than 0");
}
if (bufferSize < 0)
{
throw new ArgumentOutOfRangeException(nameof(dataSize), bufferSize, "Buffer size cannot be less than 0");
}
DataSize = dataSize;
BufferSize = bufferSize;
// Update the counter data.
UpdateCounter();
}
#endregion
#region Properties
/// <summary>
/// Returns the size of the expected data.
/// </summary>
/// <value>
/// Size of expected data.
/// </value>
public int DataSize { get; }
/// <summary>
/// Returns the size of the buffer.
/// </summary>
/// <value>
/// Buffer size.
/// </value>
public int BufferSize { get; }
/// <summary>
/// Returns the available buffer size for receiving data.
/// </summary>
/// <value>
/// Available buffer size.
/// </value>
public int Available
{
get
{
return m_Available;
}
}
/// <summary>
/// Returns a value indicating whether the thread should wait for data.
/// </summary>
/// <value>
/// <see langword="true"/> if the stream is waiting for data; otherwise, <see langword="false"/>.
/// </value>
public bool IsExpected
{
get
{
return m_IsExpected;
}
}
#endregion
#region Methods
// Updates the counter.
private void UpdateCounter()
{
int unreadDataSize = DataSize - m_Received;
m_Available = unreadDataSize < BufferSize ? unreadDataSize : BufferSize;
m_IsExpected = m_Available > 0;
}
/// <summary>
/// Specifies the size of the received data.
/// </summary>
/// <param name="bytes">
/// The size of the received data.
/// </param>
public void Count(int bytes)
{
// NOTE: Counter cannot decrease.
if (bytes > 0)
{
int received = m_Received += bytes;
// NOTE: The value of the received data cannot exceed the size of the expected data.
m_Received = (received < DataSize) ? received : DataSize;
// Update the counter data.
UpdateCounter();
}
}
/// <summary>
/// Resets counter data.
/// </summary>
public void Reset()
{
m_Received = 0;
UpdateCounter();
}
#endregion
}
#endregion
}
}
Use a do-while loop. This will make sure the memory stream pointers have moved. The first Read or ReadAsync will cause the memorystream pointer to move and then onwards the ".DataAvailable" property will continue to return true until we hit the end of the stream.
An example from microsoft docs:
// Check to see if this NetworkStream is readable.
if(myNetworkStream.CanRead){
byte[] myReadBuffer = new byte[1024];
StringBuilder myCompleteMessage = new StringBuilder();
int numberOfBytesRead = 0;
// Incoming message may be larger than the buffer size.
do{
numberOfBytesRead = myNetworkStream.Read(myReadBuffer, 0, myReadBuffer.Length);
myCompleteMessage.AppendFormat("{0}", Encoding.ASCII.GetString(myReadBuffer, 0, numberOfBytesRead));
}
while(myNetworkStream.DataAvailable);
// Print out the received message to the console.
Console.WriteLine("You received the following message : " +
myCompleteMessage);
}
else{
Console.WriteLine("Sorry. You cannot read from this NetworkStream.");
}
Original Micorosoft Doc

Reading text files line by line, with exact offset/position reporting

My simple requirement: Reading a huge (> a million) line test file (For this example assume it's a CSV of some sorts) and keeping a reference to the beginning of that line for faster lookup in the future (read a line, starting at X).
I tried the naive and easy way first, using a StreamWriter and accessing the underlying BaseStream.Position. Unfortunately that doesn't work as I intended:
Given a file containing the following
Foo
Bar
Baz
Bla
Fasel
and this very simple code
using (var sr = new StreamReader(#"C:\Temp\LineTest.txt")) {
string line;
long pos = sr.BaseStream.Position;
while ((line = sr.ReadLine()) != null) {
Console.Write("{0:d3} ", pos);
Console.WriteLine(line);
pos = sr.BaseStream.Position;
}
}
the output is:
000 Foo
025 Bar
025 Baz
025 Bla
025 Fasel
I can imagine that the stream is trying to be helpful/efficient and probably reads in (big) chunks whenever new data is necessary. For me this is bad..
The question, finally: Any way to get the (byte, char) offset while reading a file line by line without using a basic Stream and messing with \r \n \r\n and string encoding etc. manually? Not a big deal, really, I just don't like to build things that might exist already..
You could create a TextReader wrapper, which would track the current position in the base TextReader :
public class TrackingTextReader : TextReader
{
private TextReader _baseReader;
private int _position;
public TrackingTextReader(TextReader baseReader)
{
_baseReader = baseReader;
}
public override int Read()
{
_position++;
return _baseReader.Read();
}
public override int Peek()
{
return _baseReader.Peek();
}
public int Position
{
get { return _position; }
}
}
You could then use it as follows :
string text = #"Foo
Bar
Baz
Bla
Fasel";
using (var reader = new StringReader(text))
using (var trackingReader = new TrackingTextReader(reader))
{
string line;
while ((line = trackingReader.ReadLine()) != null)
{
Console.WriteLine("{0:d3} {1}", trackingReader.Position, line);
}
}
After searching, testing and do something crazy, there is my code to solve (I'm currently using this code in my product).
public sealed class TextFileReader : IDisposable
{
FileStream _fileStream = null;
BinaryReader _binReader = null;
StreamReader _streamReader = null;
List<string> _lines = null;
long _length = -1;
/// <summary>
/// Initializes a new instance of the <see cref="TextFileReader"/> class with default encoding (UTF8).
/// </summary>
/// <param name="filePath">The path to text file.</param>
public TextFileReader(string filePath) : this(filePath, Encoding.UTF8) { }
/// <summary>
/// Initializes a new instance of the <see cref="TextFileReader"/> class.
/// </summary>
/// <param name="filePath">The path to text file.</param>
/// <param name="encoding">The encoding of text file.</param>
public TextFileReader(string filePath, Encoding encoding)
{
if (!File.Exists(filePath))
throw new FileNotFoundException("File (" + filePath + ") is not found.");
_fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read);
_length = _fileStream.Length;
_binReader = new BinaryReader(_fileStream, encoding);
}
/// <summary>
/// Reads a line of characters from the current stream at the current position and returns the data as a string.
/// </summary>
/// <returns>The next line from the input stream, or null if the end of the input stream is reached</returns>
public string ReadLine()
{
if (_binReader.PeekChar() == -1)
return null;
string line = "";
int nextChar = _binReader.Read();
while (nextChar != -1)
{
char current = (char)nextChar;
if (current.Equals('\n'))
break;
else if (current.Equals('\r'))
{
int pickChar = _binReader.PeekChar();
if (pickChar != -1 && ((char)pickChar).Equals('\n'))
nextChar = _binReader.Read();
break;
}
else
line += current;
nextChar = _binReader.Read();
}
return line;
}
/// <summary>
/// Reads some lines of characters from the current stream at the current position and returns the data as a collection of string.
/// </summary>
/// <param name="totalLines">The total number of lines to read (set as 0 to read from current position to end of file).</param>
/// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
public List<string> ReadLines(int totalLines)
{
if (totalLines < 1 && this.Position == 0)
return this.ReadAllLines();
_lines = new List<string>();
int counter = 0;
string line = this.ReadLine();
while (line != null)
{
_lines.Add(line);
counter++;
if (totalLines > 0 && counter >= totalLines)
break;
line = this.ReadLine();
}
return _lines;
}
/// <summary>
/// Reads all lines of characters from the current stream (from the begin to end) and returns the data as a collection of string.
/// </summary>
/// <returns>The next lines from the input stream, or empty collectoin if the end of the input stream is reached</returns>
public List<string> ReadAllLines()
{
if (_streamReader == null)
_streamReader = new StreamReader(_fileStream);
_streamReader.BaseStream.Seek(0, SeekOrigin.Begin);
_lines = new List<string>();
string line = _streamReader.ReadLine();
while (line != null)
{
_lines.Add(line);
line = _streamReader.ReadLine();
}
return _lines;
}
/// <summary>
/// Gets the length of text file (in bytes).
/// </summary>
public long Length
{
get { return _length; }
}
/// <summary>
/// Gets or sets the current reading position.
/// </summary>
public long Position
{
get
{
if (_binReader == null)
return -1;
else
return _binReader.BaseStream.Position;
}
set
{
if (_binReader == null)
return;
else if (value >= this.Length)
this.SetPosition(this.Length);
else
this.SetPosition(value);
}
}
void SetPosition(long position)
{
_binReader.BaseStream.Seek(position, SeekOrigin.Begin);
}
/// <summary>
/// Gets the lines after reading.
/// </summary>
public List<string> Lines
{
get
{
return _lines;
}
}
/// <summary>
/// Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
/// </summary>
public void Dispose()
{
if (_binReader != null)
_binReader.Close();
if (_streamReader != null)
{
_streamReader.Close();
_streamReader.Dispose();
}
if (_fileStream != null)
{
_fileStream.Close();
_fileStream.Dispose();
}
}
~TextFileReader()
{
this.Dispose();
}
}
This is really tough issue.
After very long and exhausting enumeration of different solutions in the internet (including solutions from this thread, thank you!) I had to create my own bicycle.
I had following requirements:
Performance - reading must be very fast, so reading one char at the time or using reflection are not acceptable, so buffering is required
Streaming - file can be huge, so it is not acceptable to read it to memory entirely
Tailing - file tailing should be available
Long lines - lines can be very long, so buffer can't be limited
Stable - single byte error was immediately visible during usage. Unfortunately for me, several implementations I found were with stability problems
public class OffsetStreamReader
{
private const int InitialBufferSize = 4096;
private readonly char _bom;
private readonly byte _end;
private readonly Encoding _encoding;
private readonly Stream _stream;
private readonly bool _tail;
private byte[] _buffer;
private int _processedInBuffer;
private int _informationInBuffer;
public OffsetStreamReader(Stream stream, bool tail)
{
_buffer = new byte[InitialBufferSize];
_processedInBuffer = InitialBufferSize;
if (stream == null || !stream.CanRead)
throw new ArgumentException("stream");
_stream = stream;
_tail = tail;
_encoding = Encoding.UTF8;
_bom = '\uFEFF';
_end = _encoding.GetBytes(new [] {'\n'})[0];
}
public long Offset { get; private set; }
public string ReadLine()
{
// Underlying stream closed
if (!_stream.CanRead)
return null;
// EOF
if (_processedInBuffer == _informationInBuffer)
{
if (_tail)
{
_processedInBuffer = _buffer.Length;
_informationInBuffer = 0;
ReadBuffer();
}
return null;
}
var lineEnd = Search(_buffer, _end, _processedInBuffer);
var haveEnd = true;
// File ended but no finalizing newline character
if (lineEnd.HasValue == false && _informationInBuffer + _processedInBuffer < _buffer.Length)
{
if (_tail)
return null;
else
{
lineEnd = _informationInBuffer;
haveEnd = false;
}
}
// No end in current buffer
if (!lineEnd.HasValue)
{
ReadBuffer();
if (_informationInBuffer != 0)
return ReadLine();
return null;
}
var arr = new byte[lineEnd.Value - _processedInBuffer];
Array.Copy(_buffer, _processedInBuffer, arr, 0, arr.Length);
Offset = Offset + lineEnd.Value - _processedInBuffer + (haveEnd ? 1 : 0);
_processedInBuffer = lineEnd.Value + (haveEnd ? 1 : 0);
return _encoding.GetString(arr).TrimStart(_bom).TrimEnd('\r', '\n');
}
private void ReadBuffer()
{
var notProcessedPartLength = _buffer.Length - _processedInBuffer;
// Extend buffer to be able to fit whole line to the buffer
// Was [NOT_PROCESSED]
// Become [NOT_PROCESSED ]
if (notProcessedPartLength == _buffer.Length)
{
var extendedBuffer = new byte[_buffer.Length + _buffer.Length/2];
Array.Copy(_buffer, extendedBuffer, _buffer.Length);
_buffer = extendedBuffer;
}
// Copy not processed information to the begining
// Was [PROCESSED NOT_PROCESSED]
// Become [NOT_PROCESSED ]
Array.Copy(_buffer, (long) _processedInBuffer, _buffer, 0, notProcessedPartLength);
// Read more information to the empty part of buffer
// Was [ NOT_PROCESSED ]
// Become [ NOT_PROCESSED NEW_NOT_PROCESSED ]
_informationInBuffer = notProcessedPartLength + _stream.Read(_buffer, notProcessedPartLength, _buffer.Length - notProcessedPartLength);
_processedInBuffer = 0;
}
private int? Search(byte[] buffer, byte byteToSearch, int bufferOffset)
{
for (int i = bufferOffset; i < buffer.Length - 1; i++)
{
if (buffer[i] == byteToSearch)
return i;
}
return null;
}
}
Though Thomas Levesque's solution works well, here's mine. It uses reflection so it will be slower, but it's encoding-independent. Plus I added Seek extension too.
/// <summary>Useful <see cref="StreamReader"/> extentions.</summary>
public static class StreamReaderExtentions
{
/// <summary>Gets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
/// <remarks><para>This method is quite slow. It uses reflection to access private <see cref="StreamReader"/> fields. Don't use it too often.</para></remarks>
/// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
/// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
/// <returns>The current position of this stream.</returns>
public static long GetPosition(this StreamReader streamReader)
{
if (streamReader == null)
throw new ArgumentNullException("streamReader");
var charBuffer = (char[])streamReader.GetType().InvokeMember("charBuffer", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
var charPos = (int)streamReader.GetType().InvokeMember("charPos", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
var charLen = (int)streamReader.GetType().InvokeMember("charLen", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
var offsetLength = streamReader.CurrentEncoding.GetByteCount(charBuffer, charPos, charLen - charPos);
return streamReader.BaseStream.Position - offsetLength;
}
/// <summary>Sets the position within the <see cref="StreamReader.BaseStream"/> of the <see cref="StreamReader"/>.</summary>
/// <remarks>
/// <para><see cref="StreamReader.BaseStream"/> should be seekable.</para>
/// <para>This method is quite slow. It uses reflection and flushes the charBuffer of the <see cref="StreamReader.BaseStream"/>. Don't use it too often.</para>
/// </remarks>
/// <param name="streamReader">Source <see cref="StreamReader"/>.</param>
/// <param name="position">The point relative to origin from which to begin seeking.</param>
/// <param name="origin">Specifies the beginning, the end, or the current position as a reference point for origin, using a value of type <see cref="SeekOrigin"/>. </param>
/// <exception cref="ArgumentNullException">Occurs when passed <see cref="StreamReader"/> is null.</exception>
/// <exception cref="ArgumentException">Occurs when <see cref="StreamReader.BaseStream"/> is not seekable.</exception>
/// <returns>The new position in the stream. This position can be different to the <see cref="position"/> because of the preamble.</returns>
public static long Seek(this StreamReader streamReader, long position, SeekOrigin origin)
{
if (streamReader == null)
throw new ArgumentNullException("streamReader");
if (!streamReader.BaseStream.CanSeek)
throw new ArgumentException("Underlying stream should be seekable.", "streamReader");
var preamble = (byte[])streamReader.GetType().InvokeMember("_preamble", BindingFlags.DeclaredOnly | BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, streamReader, null);
if (preamble.Length > 0 && position < preamble.Length) // preamble or BOM must be skipped
position += preamble.Length;
var newPosition = streamReader.BaseStream.Seek(position, origin); // seek
streamReader.DiscardBufferedData(); // this updates the buffer
return newPosition;
}
}
Would this work:
using (var sr = new StreamReader(#"C:\Temp\LineTest.txt")) {
string line;
long pos = 0;
while ((line = sr.ReadLine()) != null) {
Console.Write("{0:d3} ", pos);
Console.WriteLine(line);
pos += line.Length;
}
}

Best way to find position in the Stream where given byte sequence starts

How do you think what is the best way to find position in the System.Stream where given byte sequence starts (first occurence):
public static long FindPosition(Stream stream, byte[] byteSequence)
{
long position = -1;
/// ???
return position;
}
P.S. The simpliest yet fastest solution is preffered. :)
I've reached this solution.
I did some benchmarks with an ASCII file that was 3.050 KB and 38803 lines.
With a search byte array of 22 bytes in the last line of the file I've got the result in about 2.28 seconds (in a slow/old machine).
public static long FindPosition(Stream stream, byte[] byteSequence)
{
if (byteSequence.Length > stream.Length)
return -1;
byte[] buffer = new byte[byteSequence.Length];
using (BufferedStream bufStream = new BufferedStream(stream, byteSequence.Length))
{
int i;
while ((i = bufStream.Read(buffer, 0, byteSequence.Length)) == byteSequence.Length)
{
if (byteSequence.SequenceEqual(buffer))
return bufStream.Position - byteSequence.Length;
else
bufStream.Position -= byteSequence.Length - PadLeftSequence(buffer, byteSequence);
}
}
return -1;
}
private static int PadLeftSequence(byte[] bytes, byte[] seqBytes)
{
int i = 1;
while (i < bytes.Length)
{
int n = bytes.Length - i;
byte[] aux1 = new byte[n];
byte[] aux2 = new byte[n];
Array.Copy(bytes, i, aux1, 0, n);
Array.Copy(seqBytes, aux2, n);
if (aux1.SequenceEqual(aux2))
return i;
i++;
}
return i;
}
If you treat the stream like another sequence of bytes, you can just search it like you were doing a string search. Wikipedia has a great article on that. Boyer-Moore is a good and simple algorithm for this.
Here's a quick hack I put together in Java. It works and it's pretty close if not Boyer-Moore. Hope it helps ;)
public static final int BUFFER_SIZE = 32;
public static int [] buildShiftArray(byte [] byteSequence){
int [] shifts = new int[byteSequence.length];
int [] ret;
int shiftCount = 0;
byte end = byteSequence[byteSequence.length-1];
int index = byteSequence.length-1;
int shift = 1;
while(--index >= 0){
if(byteSequence[index] == end){
shifts[shiftCount++] = shift;
shift = 1;
} else {
shift++;
}
}
ret = new int[shiftCount];
for(int i = 0;i < shiftCount;i++){
ret[i] = shifts[i];
}
return ret;
}
public static byte [] flushBuffer(byte [] buffer, int keepSize){
byte [] newBuffer = new byte[buffer.length];
for(int i = 0;i < keepSize;i++){
newBuffer[i] = buffer[buffer.length - keepSize + i];
}
return newBuffer;
}
public static int findBytes(byte [] haystack, int haystackSize, byte [] needle, int [] shiftArray){
int index = needle.length;
int searchIndex, needleIndex, currentShiftIndex = 0, shift;
boolean shiftFlag = false;
index = needle.length;
while(true){
needleIndex = needle.length-1;
while(true){
if(index >= haystackSize)
return -1;
if(haystack[index] == needle[needleIndex])
break;
index++;
}
searchIndex = index;
needleIndex = needle.length-1;
while(needleIndex >= 0 && haystack[searchIndex] == needle[needleIndex]){
searchIndex--;
needleIndex--;
}
if(needleIndex < 0)
return index-needle.length+1;
if(shiftFlag){
shiftFlag = false;
index += shiftArray[0];
currentShiftIndex = 1;
} else if(currentShiftIndex >= shiftArray.length){
shiftFlag = true;
index++;
} else{
index += shiftArray[currentShiftIndex++];
}
}
}
public static int findBytes(InputStream stream, byte [] needle){
byte [] buffer = new byte[BUFFER_SIZE];
int [] shiftArray = buildShiftArray(needle);
int bufferSize, initBufferSize;
int offset = 0, init = needle.length;
int val;
try{
while(true){
bufferSize = stream.read(buffer, needle.length-init, buffer.length-needle.length+init);
if(bufferSize == -1)
return -1;
if((val = findBytes(buffer, bufferSize+needle.length-init, needle, shiftArray)) != -1)
return val+offset;
buffer = flushBuffer(buffer, needle.length);
offset += bufferSize-init;
init = 0;
}
} catch (IOException e){
e.printStackTrace();
}
return -1;
}
You'll basically need to keep a buffer the same size as byteSequence so that once you've found that the "next byte" in the stream matches, you can check the rest but then still go back to the "next but one" byte if it's not an actual match.
It's likely to be a bit fiddly whatever you do, to be honest :(
I needed to do this myself, had already started, and didn't like the solutions above. I specifically needed to find where the search-byte-sequence ends. In my situation, I need to fast-forward the stream until after that byte sequence. But you can use my solution for this question too:
var afterSequence = stream.ScanUntilFound(byteSequence);
var beforeSequence = afterSequence - byteSequence.Length;
Here is StreamExtensions.cs
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace System
{
static class StreamExtensions
{
/// <summary>
/// Advances the supplied stream until the given searchBytes are found, without advancing too far (consuming any bytes from the stream after the searchBytes are found).
/// Regarding efficiency, if the stream is network or file, then MEMORY/CPU optimisations will be of little consequence here.
/// </summary>
/// <param name="stream">The stream to search in</param>
/// <param name="searchBytes">The byte sequence to search for</param>
/// <returns></returns>
public static int ScanUntilFound(this Stream stream, byte[] searchBytes)
{
// For this class code comments, a common example is assumed:
// searchBytes are {1,2,3,4} or 1234 for short
// # means value that is outside of search byte sequence
byte[] streamBuffer = new byte[searchBytes.Length];
int nextRead = searchBytes.Length;
int totalScannedBytes = 0;
while (true)
{
FillBuffer(stream, streamBuffer, nextRead);
totalScannedBytes += nextRead; //this is only used for final reporting of where it was found in the stream
if (ArraysMatch(searchBytes, streamBuffer, 0))
return totalScannedBytes; //found it
nextRead = FindPartialMatch(searchBytes, streamBuffer);
}
}
/// <summary>
/// Check all offsets, for partial match.
/// </summary>
/// <param name="searchBytes"></param>
/// <param name="streamBuffer"></param>
/// <returns>The amount of bytes which need to be read in, next round</returns>
static int FindPartialMatch(byte[] searchBytes, byte[] streamBuffer)
{
// 1234 = 0 - found it. this special case is already catered directly in ScanUntilFound
// #123 = 1 - partially matched, only missing 1 value
// ##12 = 2 - partially matched, only missing 2 values
// ###1 = 3 - partially matched, only missing 3 values
// #### = 4 - not matched at all
for (int i = 1; i < searchBytes.Length; i++)
{
if (ArraysMatch(searchBytes, streamBuffer, i))
{
// EG. Searching for 1234, have #123 in the streamBuffer, and [i] is 1
// Output: 123#, where # will be read using FillBuffer next.
Array.Copy(streamBuffer, i, streamBuffer, 0, searchBytes.Length - i);
return i; //if an offset of [i], makes a match then only [i] bytes need to be read from the stream to check if there's a match
}
}
return 4;
}
/// <summary>
/// Reads bytes from the stream, making sure the requested amount of bytes are read (streams don't always fulfill the full request first time)
/// </summary>
/// <param name="stream">The stream to read from</param>
/// <param name="streamBuffer">The buffer to read into</param>
/// <param name="bytesNeeded">How many bytes are needed. If less than the full size of the buffer, it fills the tail end of the streamBuffer</param>
static void FillBuffer(Stream stream, byte[] streamBuffer, int bytesNeeded)
{
// EG1. [123#] - bytesNeeded is 1, when the streamBuffer contains first three matching values, but now we need to read in the next value at the end
// EG2. [####] - bytesNeeded is 4
var bytesAlreadyRead = streamBuffer.Length - bytesNeeded; //invert
while (bytesAlreadyRead < streamBuffer.Length)
{
bytesAlreadyRead += stream.Read(streamBuffer, bytesAlreadyRead, streamBuffer.Length - bytesAlreadyRead);
}
}
/// <summary>
/// Checks if arrays match exactly, or with offset.
/// </summary>
/// <param name="searchBytes">Bytes to search for. Eg. [1234]</param>
/// <param name="streamBuffer">Buffer to match in. Eg. [#123] </param>
/// <param name="startAt">When this is zero, all bytes are checked. Eg. If this value 1, and it matches, this means the next byte in the stream to read may mean a match</param>
/// <returns></returns>
static bool ArraysMatch(byte[] searchBytes, byte[] streamBuffer, int startAt)
{
for (int i = 0; i < searchBytes.Length - startAt; i++)
{
if (searchBytes[i] != streamBuffer[i + startAt])
return false;
}
return true;
}
}
}
Bit old question, but here's my answer. I've found that reading blocks and then searching in that is extremely inefficient compared to just reading one at a time and going from there.
Also, IIRC, the accepted answer would fail if part of the sequence was in one block read and half in another - ex, given 12345, searching for 23, it would read 12, not match, then read 34, not match, etc... haven't tried it, though, seeing as it requires net 4.0. At any rate, this is way simpler, and likely much faster.
static long ReadOneSrch(Stream haystack, byte[] needle)
{
int b;
long i = 0;
while ((b = haystack.ReadByte()) != -1)
{
if (b == needle[i++])
{
if (i == needle.Length)
return haystack.Position - needle.Length;
}
else
i = b == needle[0] ? 1 : 0;
}
return -1;
}
static long Search(Stream stream, byte[] pattern)
{
long start = -1;
stream.Seek(0, SeekOrigin.Begin);
while(stream.Position < stream.Length)
{
if (stream.ReadByte() != pattern[0])
continue;
start = stream.Position - 1;
for (int idx = 1; idx < pattern.Length; idx++)
{
if (stream.ReadByte() != pattern[idx])
{
start = -1;
break;
}
}
if (start > -1)
{
return start;
}
}
return start;
}

Categories

Resources