I'm writing a library to simplify my network programming in future projects. I'm wanting it to be robust and efficient because this will be in nearly all of my projects in the future. (BTW both the server and the client will be using my library so I'm not assuming a protocol in my question) I'm writing a function for receiving strings from a network stream where I use 31 bytes of buffer and one for sentinel. The sentinel value will indicate which byte if any is the EOF. Here's my code for your use or scrutiny...
public string getString()
{
string returnme = "";
while (true)
{
int[] buff = new int[32];
for (int i = 0; i < 32; i++)
{
buff[i] = ns.ReadByte();
}
if (buff[31] > 31) { /*throw some error*/}
for (int i = 0; i < buff[31]; i++)
{
returnme += (char)buff[i];
}
if (buff[31] != 31)
{
break;
}
}
return returnme;
}
Edit: Is this the best (efficient, practical, etc) to accomplish what I'm doing.
Is this the best (efficient, practical, etc) to accomplish what I'm doing.
No. Firstly, you are limiting yourself to characters in the 0-255 code-point range, and that isn't enough, and secondly: serializing strings is a solved problem. Just use an Encoding, typically UTF-8. As part of a network stream, this probably means "encoode the length, encode the data" and "read the length, buffer that much data, decode the data". As another note: you aren't correctly handling the EOF scenario if ReadByte() returns a negative value.
As a small corollary, note that appending to a string in a loop is never a good idea; if you did do it that way, use a StringBuilder. But don't do it that way. My code would be something more like (hey, whadya know, here's my actual string-reading code from protobuf-net, simplified a bit):
// read the length
int bytes = (int)ReadUInt32Variant(false);
if (bytes == 0) return "";
// buffer that much data
if (available < bytes) Ensure(bytes, true);
// read the string
string s = encoding.GetString(ioBuffer, ioIndex, bytes);
// update the internal buffer data
available -= bytes;
position += bytes;
ioIndex += bytes;
return s;
As a final note, I would say: if you are sending structured messages, give some serious consideration to using a pre-rolled serialization API that specialises in this stuff. For example, you could then just do something like:
var msg = new MyMessage { Name = "abc", Value = 123, IsMagic = true };
Serializer.SerializeWithLengthPrefix(networkStream, msg);
and at the other end:
var msg = Serializer.DeserializeWithLengthPrefix<MyMessage>(networkStream);
Console.WriteLine(msg.Name); // etc
Job done.
I think tou should use a StringBuilder object with fixed size for better performance.
Related
I'm looking for an efficient, allocation-free (!) implementation of
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
// Should return the index of the first byte of #char within utf8Bytes
// (not the character index of #char within the string)
}
I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters.
And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.
Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?
Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:
public static class StringExtensions
{
const int MaxBytes = 4;
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
Rune rune;
try
{
rune = new Rune(#char);
}
catch (ArgumentOutOfRangeException)
{
// Malformed unicode character, return -1 or throw?
return -1;
}
return utf8Bytes.IndexOf(rune);
}
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune #char)
{
Span<byte> charBytes = stackalloc byte[MaxBytes];
var n = #char.EncodeToUtf8(charBytes);
charBytes = charBytes.Slice(0, n);
for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i += thisLength)
{
thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
return i;
}
return -1;
}
static int Utf8ByteSequenceLength(byte firstByte)
{
//https://en.wikipedia.org/wiki/UTF-8#Encoding
if ( (firstByte & 0b11111000) == 0b11110000) // 11110xxx
return 4;
else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
return 3;
else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
return 2;
return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
}
}
Notes:
Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.
Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.
If char #char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.
As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.
Demo fiddle here.
You can negate allocations with stackalloc. First approximation can look like:
static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char #char)
{
Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
var indexOf = chars.IndexOf(#char);
if (indexOf > 0)
{
Span<byte> bytes = stackalloc byte[indexOf * 4];
var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
return (result, proc);
}
return (indexOf, proc);
}
There are few notes here:
Big incoming spans can result in SO
Decoding the whole array is not optimal
Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).
Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.
From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.
Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.
EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it.
For example:
"blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?
Your best solution is to use an implementation of TextReader:
StreamReader if you're reading from a stream
StringReader if you're reading from a string
With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:
int ch = reader.Read();
Internally the magic is done through the C# Decoder class (which comes from your Encoding):
var decoder = Encoding.UTF7.GetDecoder();
The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.
Pseudocode
Untried, untested, and only happens to look like C#:
String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
StringBuilder sb = new StringBuilder();
TextReader rdr = new StreamReader(stm, encoding);
int ch = rdr.Read();
while (ch > 0) //returns -1 when we've hit the end, and 0 is null
{
sb.AppendChar(Char(ch));
int ch = rdr.Read();
}
return sb.ToString();
}
Note: Any code released into public domain. No attribution required.
I am trying to read a byte[] in my Java application sent from a C# application.
However, when I encode the string to byte[] in C# and reads it in Java, using the code below, I get all the characters but the last. Why is that?
Java receiving code:
int data = streamFromClient.read();
while(data != -1){
char theChar = (char) data;
data = streamFromClient.read();
System.out.println("" + theChar);
}
C# sending code:
public void WriteMessage(string msg){
byte[] msgBuffer = Encoding.Default.GetBytes(msg);
sck.Send(msgBuffer, 0, msgBuffer.Length, 0);
}
While others already have provided a possible solution, i want to show how i would solve this.
int data;
while((data=streamFromClient.read()) != -1) {
char theChar = (char) data;
System.out.println("" + theChar);
}
I just feel like this approach would be a little more clear. Feel free to choose which you like better.
To clarify: There is an error in your receiving code. You read the last byte but you never process it. On every iteration you print out the value of the byte received in the previous iteration. So the last byte would be processed when data is -1 but you don't enter the loop so it isn't printed.
You need to read the data untill unless it is becomes -1.
so you need to move the reading statement inside infinite loop i.e. while(true) and comeout from it once if the result is -1.
Try This:
int data = -1;
while(true){
data=streamFromClient.read();
if(data==-1)
{
break;
}
char theChar = (char) data;
System.out.println("" + theChar);
}
I thought that your while loop might have been off by 1 iteration somehow but it seems to me that your C# program is simply not sending all the characters.
I'm writing an application that needs to verify HMAC-SHA256 checksums. The code I currently have looks something like this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = System.Text.Encoding.UTF8.GetBytes(secret);
byte[] value = System.Text.Encoding.UTF8.GetBytes(data);
byte[] checksum_bytes = System.Text.Encoding.UTF8.GetBytes(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expected_bytes = hmac.ComputeHash(value);
return checksum_bytes.SequenceEqual(expected_bytes);
}
}
I know that this is susceptible to timing attacks.
Is there a message digest comparison function in the standard library? I realize I could write my own time hardened comparison method, but I have to believe that this is already implemented elsewhere.
EDIT: Original answer is below - still worth reading IMO, but regarding the timing attack...
The page you referenced gives some interesting points about compiler optimizations. Given that you know the two byte arrays will be the same length (assuming the size of the checksum isn't particularly secret, you can immediately return if the lengths are different) you might try something like this:
public static bool CompareArraysExhaustively(byte[] first, byte[] second)
{
if (first.Length != second.Length)
{
return false;
}
bool ret = true;
for (int i = 0; i < first.Length; i++)
{
ret = ret & (first[i] == second[i]);
}
return ret;
}
Now that still won't take the same amount of time for all inputs - if the two arrays are both in L1 cache for example, it's likely to be faster than if it has to be fetched from main memory. However, I suspect that is unlikely to cause a significant issue from a security standpoint.
Is this okay? Who knows. Different processors and different versions of the CLR may take different amounts of time for an & operation depending on the two operands. Basically this is the same as the conclusion of the page you referenced - that it's probably as good as we'll get in a portable way, but that it would require validation on every platform you try to run on.
At least the above code only uses relatively simple operations. I would personally avoid using LINQ operations here as there could be sneaky optimizations going on in some cases. I don't think there would be in this case - or they'd be easy to defeat - but you'd at least have to think about them. With the above code, there's at least a reasonably close relationship between the source code and IL - leaving "only" the JIT compiler and processor optimizations to worry about :)
Original answer
There's one significant problem with this: in order to provide the checksum, you have to have a string whose UTF-8 encoded form is the same as the checksum. There are plenty of byte sequences which simply don't represent UTF-8-encoded text. Basically, trying to encode arbitrary binary data as text using UTF-8 is a bad idea.
Base64, on the other hand, is basically designed for this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
byte[] checksumBytes = Convert.FromBase64String(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expectedBytes = hmac.ComputeHash(value);
return checksumBytes.SequenceEqual(expectedBytes);
}
}
On the other hand, instead of using SequenceEqual on the byte array, you could Base64 encode the actual hash, and see whether that matches:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
using (var hmac = new HMACSHA256(key))
{
return checksum == Convert.ToBase64String(hmac.ComputeHash(value));
}
}
I don't know of anything better within the framework. It wouldn't be too hard to write a specialized SequenceEqual operator for arrays (or general ICollection<T> implementations) which checked for equal lengths first... but given that the hashes are short, I wouldn't worry about that.
If you're worried about the timing of the SequenceEqual, you could always replace it with something like this:
checksum_bytes.Zip( expected_bytes, (a,b) => a == b ).Aggregate( true, (a,r) => a && r );
This returns the same result as SequenceEquals but always check every element before given an answer this less chance of revealing anything through a timing attack.
How it is susceptible to timing attacks? Your code works the same amount of time in the case of valid or invalid digest. And calculate digest/check digest looks like the easiest way to check this.
I'm trying to do some parsing that will be easier using regular expressions.
The input is an array (or enumeration) of bytes.
I don't want to convert the bytes to chars for the following reasons:
Computation efficiency
Memory consumption efficiency
Some non-printable bytes might be complex to convert to chars. Not all the bytes are printable.
So I can't use Regex.
The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.
How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?
Thank you.
There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.
However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).
An example:
//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };
string stringBuffer = new string('\0', 1000);
Regex regex = new Regex("ING", RegexOptions.Compiled);
unsafe
{
fixed (char* charArray = stringBuffer)
{
byte* buffer = (byte*)(charArray);
//Hard-coded example of string mutation, in practice you would
//loop over your input buffers and regex\match so that the string
//buffer is re-used.
buffer[0] = inputBuffer[0];
buffer[2] = inputBuffer[1];
buffer[4] = inputBuffer[2];
buffer[6] = inputBuffer[3];
buffer[8] = inputBuffer[4];
Console.WriteLine("Mutated string:'{0}'.",
stringBuffer.Substring(0, inputBuffer.Length));
Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
}
}
Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.
Obviously this is unsafe code, but it is .Net.
The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.
Update
Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.
Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.
The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.
Update:
Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:
Using wrapper you just have to cast
the pointer type (bytes <-> chars).
Using .NET's Regex you have to
convert each byte of the input.
As an alternative to using unsafe, just consider writing a simple, recursive comparer like:
static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
{
if (sequence[sequenceIndex] == data[dataIndex])
{
if (sequenceIndex == sequence.Length - 1)
return true;
else if (dataIndex == data.Length - 1)
return false;
else
return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
}
else
{
if (dataIndex < data.Length - 1)
return Evaluate(data, sequence, dataIndex+1, 0);
else
return false;
}
}
You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.
I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.
bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
{
payload = new byte[0];
functionResponse = UDScmd.Response.UNKNOWN;
bool positiveReponse = false;
var rxMsgBytes = rxMsg.GetBytes();
//Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
//If we could use some kind of HEX regex this would be a bit neater
//Iterate until we get past any and all null padding
int stateMachine = 0;
for (int i = 0; i < rxMsgBytes.Length; i++)
{
switch (stateMachine)
{
case 0:
if (rxMsgBytes[i] == 0x07) stateMachine = 1;
break;
case 1:
if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
else return false;
case 2:
if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
{
//Positive response to the requested mode
positiveReponse = true;
}
else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
{
//This is an invalid response, give up now
return false;
}
stateMachine = 3;
break;
case 3:
functionResponse = (UDScmd.Response)rxMsgBytes[i];
if (positiveReponse && rxMsgBytes[i] == txSubFunction)
{
//We have a positive response and a positive subfunction code (subfunction is reflected)
int payloadLength = rxMsgBytes.Length - i;
if(payloadLength > 0)
{
payload = new byte[payloadLength];
Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
}
return true;
} else
{
//We had a positive response but a negative subfunction error
//we return the function error code so it can be relayed
return false;
}
default:
return false;
}
}
return false;
}