Let's assume we have a simple internet socket, and it's going to send 10 megabytes (because I want to ignore memory issues) of random data through.
Is there any performance difference or a best practice method that one should use for receiving data? The final output data should be represented by a byte[]. Yes I know writing an arbitrary amount of data to memory is bad, and if I was downloading a large file I wouldn't be doing it like this. But for argument's sake let's ignore that and assume it's a smallish amount of data. I also realise that the bottleneck here is probably not the memory management but rather the socket receiving. I just want to know what would be the most efficient method of receiving data.
A few dodgy ways can think of is:
Have a List and a buffer, after the buffer is full, add it to the list and at the end list.ToArray() to get the byte[]
Write the buffer to a memory stream, after its complete construct a byte[] of the stream.Length and read it all into it in order to get the byte[] output.
Is there a more efficient/better way of doing this?
Just write to a MemoryStream and then call ToArray - that does the business of constructing an appropriately-sized byte array for you. That's effectively what a List<byte> would be like anyway, but using a MemoryStream will be a lot simpler.
Well, Jon Skeet's answer is great (as usual), but there's no code, so here's my interpretation. (Worked fine for me.)
using (var mem = new MemoryStream())
{
using (var tcp = new TcpClient())
{
tcp.Connect(new IPEndPoint(IPAddress.Parse("192.0.0.192"), 8880));
tcp.GetStream().CopyTo(mem);
}
var bytes = mem.ToArray();
}
(Why not combine the two usings? Well, if you want to debug, you might want to release the tcp connection before taking your time looking at the bytes received.)
This code will receive multiple packets and aggregate their data, FYI. So it's a great way to simply receive all tcp data sent during a connection.
What is the encoding of your data? is it plain ASCII, or is it something else, like UTF-8/Unicode?
if it is plain ASCII, you could just allocate a StringBuilder() of the required size (get the size from the ContentLength header of the response) and keep on appending your data to the builder, after converting it into a string using Encoding.ASCII.
If it is Unicode/UTF8 then you have an issue - you cannot just call Encoding..GetString(buffer, 0, bytesRead) on the bytes read, because the bytesRead might not constitute a logical string fragment in that encoding. For this case you will need to buffer the entire entity body into memory(or file), then read that file and decode it using the encoding.
You could write to a memory stream, then use a streamreader or something like that to get the data. What are you doing with the data? I ask because would be more efficient from a memory standpoint to write the incoming data to a file or database table as the data is being received rather than storing the entire contents in memory.
Related
Currently I have this code
SessionStream(Request.Content.ReadAsStreamAsync(), new { });
I need to somehow "mirror" the incoming stream so that I have two instances of it.
Something like the following pseudo code:
Task<Stream> stream = Request.Content.ReadAsStreamAsync();
SessionStream(stream, new { });
Stream theotherStram;
stream.Result.CopyToAsync(theotherStram)
ThoOtherStream(theotherStram, new { });
A technique that always works is to copy the stream to a MemoryStream and then use that.
Often, it is more efficient to just seek the original stream back to the beginning using the Seek method. That only works if this stream supports seeking.
If you do not want to buffer and cannot seek you need to push the stream contents blockwise two the two consumers. Read a block, write it two times.
If in addition you need a pull model (i.e. hand a readable stream to some component) it gets really hard and threading is involved. You'd need to write a push-to-pull adapter which is always hard.
The answer of usr is still correct in 2020, but for those wondering about why it is not trivial here is a short explanation.
The idea behind the steams is that writing to the stream and reading from it are independent. Usually, the process of reading is much faster then writing (think about receiving data through network - you can read the data as soon as it arrives) so usually the reader waits for new portion of data, processes it as soon as it arrives and then drops it to free the memory, and then waits for the next portion.
This allows processing potentially infinite data stream (for example, application log stream) without using much RAM.
Suppose now we have 2 readers (as required by the question). A data portion arrives, and then we have to wait for both the readers to read the data before we can drop it. Which means that it must be stored in memory until both readers are done with it. The problem is that the readers can process the data with very different speed. E.g. one can write it to a file, another can just count the symbols in memory. In this case either the fast one would have to wait for the slow one before reading further, or we would need to save the data to a buffer in memory, and let the readers read from it. In the worst case we will end up with the full copy of the input stream in memory, basically creating an instance of memory stream.
To implement the first option, you would have to implement a stream reader that is aware which of your stream usage is faster, and considering it, would distribute and drop the data accordingly.
If you are sure you have enough memory, and processing speed is not critical, just use the memory stream:
using var memStream = new MemoryStream();
await incomingStream.CopyToAsync(memStream);
UseTheStreamForTheFirstTime(memStream);
memStream.Seek(0, SeekOrigin.Begin);
UseTheStreamAnotherTime(memStream);
Lets say I want to do non blocking reads from a network socket.
I can async await for the socket to read x bytes and all is fine.
But how do I combine this with deserialization via protobuf?
Reading objects from a stream must be blocking? that is, if the stream contains too little data for the parser, then there has to be some blocking going on behind the scenes so that the reader can fetch all the bytes it needs.
I guess I can use lengthprefix delimiters and read the first bytes and then figure out how many bytes I have to fetch minimum before I parse, is this the right way to go about it?
e.g. if my buffer is 500 bytes, then await those 500 bytes, and parse the lengthprefix and if the length is more than 500 then wait again, untill all of it is read.
What is the idiomatic way to combine non blocking IO and protobuf parsing?
(I'm using Jon Skeet's implementation right now http://code.google.com/p/protobuf-csharp-port/)
As a general rule, serializers don't often contain a DeserializeAsync method, because that is really really hard to do (at least, efficiently). If the data is of moderate size, then I would advise to buffer the required amount of data using asynchronous code - and then deserialize when all of the required data is available. If the data is very large and you don't want to have to buffer everything in memory, then consider using a regular synchronous deserialize on a worker thread.
(note that note of this is implementation specific, but if the serializer you are using does support an async deserialize: then sure, use that)
Use the BeginReceive/EndRecieve() methods to receive your data into a byte buffer (typically 1024 or 2048 bytes). In the AsyncCallback, after ensuring that you didn't read -1/0 bytes (end of stream/disconnect/io error), attempt to deserialize the packet with ProtocolBuf.
Your receive callback will be asynchronous, and it makes sense to parse the packet in the same thread as the reading, IMHO. It's the handling that will likely cause the biggest performance hit.
I have an IInputStream that I want to read data from until I encounter a certain byte, at which point I will pass the IInputStream to some other object to consume the rest of the stream.
This is what I've come up with:
public async Task HandleInputStream(IInputStream instream)
{
using (var dataReader = new DataReader(instream))
{
byte b;
do
{
await dataReader.LoadAsync(1);
b = dataReader.ReadByte();
// Do something with the byte
} while (b != <some condition>);
dataReader.DetachStream();
}
}
It seems like running LoadData for one byte at a time will be horribly slow. My dilemma is that if I pick a buffer size (like 1024) and load that, and my value shows up 10 bytes in, then this method will have the next 1014 bytes of data and will have to pass it to the next method for processing.
Is there a better way to approach this, or is this an acceptable solution?
If the value you're looking for is not too far from the beginning of the stream, this kind of reading shouldn't be that slow. How far into the stream are you expecting it? Have you tested the performance?
Depending on the type of stream you are using, you might be able to use other approaches:
If it supports seeking backwards (e.g. you're reading from a file), you could read larger chuncks at once as long as you know at what offset you have found your value. You can then seek into the stream to that position before you hand it off.
If that's not possible you could create another intermediate memory stream into which you would copy the remaining part of the buffer you have already read, followed by the rest of the stream. This works even if you can't seek backwards. The only problem might be memory consumption if the stream is too large.
I have a socket-based application that exposes received data with a BinaryReader object on the client side. I've been trying to debug an issue where the data contained in the reader is not clean... i.e. the buffer that I'm reading contains old data past the size of the new data.
In the code below:
System.Diagnostics.Debug.WriteLine("Stream length: {0}", _binaryReader.BaseStream.Length);
byte[] buffer = _binaryReader.ReadBytes((int)_binaryReader.BaseStream.Length);
When I comment out the first line, the data doesn't end up being dirty (or, doesn't end up being dirty as regularly) as when I have that print line statement. As far as I can tell, from the server side the data is coming in cleanly, so it's possible that my socket implementation has some issues. But does anyone have any idea why adding that print line would cause the data to be dirty more often?
Your binary reader looks like it is a private member variable (if the leading underscore is a tell tell sign).
Is your application multithreaded? You could be experiencing a race condition if another thread is attempting to do also use your binaryReader while you are reading from it. The fact that you experience issues even without that line seems quite suspect to me.
Are you sure that your reading logic is correct? Stream.Length indicates the length of the entire stream, not of the remaining data to be read.
Suppose that, initially, 100 bytes were available. Length is 100, and BinaryReader corrects reads 100 bytes and advances the stream position by 100. Then, another 20 bytes arrive. Length is now 120; however, your BinaryReader should only be reading 20 bytes, not 120. The ‘extra’ 100 bytes requested in the second read would either cause it to block or (if the stream is not implemented correctly) break.
The problem was silly and unrelated. I believe my reading logic above is correct, however. The issue was that the _binaryReader I was using was a reference that was not owned by my class and hence the underlying stream was being rewritten with bad data.
I am using TcpClient to communicate with a server that sends information in form of "\n" delimited strings. The data flow is pretty high and once the channel is set, the stream would always have information to read from. The messages can be of variable sizes.
My question now is, would it be better to use ReadLine() method to read the messages from the stream as they are already "\n" delimited, or will it be advisable to read byteArray of some fixed size and pick up message strings from them using Split("\n") or such? (Yes, I do understand that there may be cases when the byte array gets only a part of the message, and we would have to implement logic for that too.)
Points that need to be considered here are:
Performance.
Data Loss. Will some data be lost if the client isn't reading as fast as the data is coming in?
Multi-Threaded setup. What if this setup has to be implemented in a multi-threaded environment, where each thread would have a separate communication channel, however would share the same resources on the client.
If Performance is your main concern then I would prefer the Read over ReadLine method. I/O is one of the slower things a program can do so you want to minimize the amount of time in I/O routines by reading as much data up front.
Data loss is not really a concern here if you are using TCP. The TCP protocol guarantees delivery and will deal with congestion issues that result in lost packets.
For the threading portion of the question we're going to need a bit more information. What resources are shared, are they sharing TcpClient's, etc ...
I would say to go with a pool of buffers and doing reads manually (Read() on the socket), if you need a lot of performance. Pooling the buffers would avoid generating garbage, as I believe ReadLine() will generate some.
Since you're usint TCP, data loss should not be a problem.
In the multi-threaded setup, you will have to be specific, but in general, resource sharing is troublesome as it might generate a data race.
Why not use a BufferedStream to ensure you are reading optimally out of the stream:
var req = (HttpWebRequest)WebRequest.Create("http://www.stackoverflow.com");
using(var resp = (HttpWebResponse)req.GetResponse())
using(var stream = resp.GetResponseStream())
using(var bufferedStream = new BufferedStream(stream))
using(var streamReader = new StreamReader(bufferedStream))
{
while(!streamReader.EndOfStream)
{
string currentLine = streamReader.ReadLine();
Console.WriteLine(currentLine);
}
}
Of course, if you're looking to scale, going async would be a necessity. As such, ReadLine is out of the question and you're back to byte array manipulations.
I would read into a byte array.... only downside: limited size. You'd need to know a certain byte amount limit, or flush the byte array into a byte collection manually sometimes, and then transform the collection back into an array of bytes, also transforming it to a string using bitConverter and finally splitting it into the real messages :p
You will have a lot of overhead flushing the array into a collection... BUT, flushing into a string would require more resources, as bytes have to be decoded AS you flush them into the string.... so it's up to you... you can choose simplicity with string or efficiency with bytes, either way it wouldnt exactly be a serious performance boost from each other, but i'd personally go with the byte collection to avoid the implicit byte conversion.
Important: This comes from personal experience from previous TCP socket usage (Quake RCON stuff), not any book or anything :) Correct me if im mistaken please.