Encoding and null terminated strings - c#

EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it.
For example:
"blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?

Your best solution is to use an implementation of TextReader:
StreamReader if you're reading from a stream
StringReader if you're reading from a string
With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:
int ch = reader.Read();
Internally the magic is done through the C# Decoder class (which comes from your Encoding):
var decoder = Encoding.UTF7.GetDecoder();
The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.
Pseudocode
Untried, untested, and only happens to look like C#:
String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
StringBuilder sb = new StringBuilder();
TextReader rdr = new StreamReader(stm, encoding);
int ch = rdr.Read();
while (ch > 0) //returns -1 when we've hit the end, and 0 is null
{
sb.AppendChar(Char(ch));
int ch = rdr.Read();
}
return sb.ToString();
}
Note: Any code released into public domain. No attribution required.

Related

IndexOf char within an ReadOnlySpan<byte> of UTF8 bytes

I'm looking for an efficient, allocation-free (!) implementation of
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
// Should return the index of the first byte of #char within utf8Bytes
// (not the character index of #char within the string)
}
I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters.
And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.
Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?
Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:
public static class StringExtensions
{
const int MaxBytes = 4;
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
Rune rune;
try
{
rune = new Rune(#char);
}
catch (ArgumentOutOfRangeException)
{
// Malformed unicode character, return -1 or throw?
return -1;
}
return utf8Bytes.IndexOf(rune);
}
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune #char)
{
Span<byte> charBytes = stackalloc byte[MaxBytes];
var n = #char.EncodeToUtf8(charBytes);
charBytes = charBytes.Slice(0, n);
for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i += thisLength)
{
thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
return i;
}
return -1;
}
static int Utf8ByteSequenceLength(byte firstByte)
{
//https://en.wikipedia.org/wiki/UTF-8#Encoding
if ( (firstByte & 0b11111000) == 0b11110000) // 11110xxx
return 4;
else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
return 3;
else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
return 2;
return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
}
}
Notes:
Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.
Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.
If char #char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.
As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.
Demo fiddle here.
You can negate allocations with stackalloc. First approximation can look like:
static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char #char)
{
Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
var indexOf = chars.IndexOf(#char);
if (indexOf > 0)
{
Span<byte> bytes = stackalloc byte[indexOf * 4];
var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
return (result, proc);
}
return (indexOf, proc);
}
There are few notes here:
Big incoming spans can result in SO
Decoding the whole array is not optimal
Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).
Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.
From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.
Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.

String to byte array only converts first 16 bytes according to Intellisense

I'm trying to convert a string to a byte[] using the ASCIIEncoder object in the .NET library. The string will never contain non-ASCII characters, but it will usually have a length greater than 16. My code looks like the following:
public static byte[] Encode(string packet)
{
ASCIIEncoder enc = new ASCIIEncoder();
byte[] byteArray = enc.GetBytes(packet);
return byteArray;
}
By the end of the method, the byte array should be full of packet.Length number of bytes, but Intellisense tells me that all bytes after byteArray[15] are literally questions marks that cannot be observed. I used Wireshark to view byteArray after I sent it and it was received on the other side fine, but the end device did not follow the instructions encoded in byteArray. I'm wondering if this has anything to do with Intellisense not being able to display all elements in byteArray, or if my packet is completely wrong.
If your packet string basically contains characters in the range 0-255, then ASCIIEncoding is not what you should be using. ASCII only defines character codes 0-127; anything in the range 128-255 will get turned into question marks (as you have observed) because there characters are not defined in ASCII.
Consider using a method like this to convert the string to a byte array. (This assumes that the ordinal value of each character is in the range 0-255 and that the ordinal value is what you want.)
public static byte[] ToOrdinalByteArray(this string str)
{
if (str == null) { throw new ArgumentNullException("str"); }
var bytes = new byte[str.Length];
for (int i = 0; i < str.Length; ++i) {
// Wrapping the cast in checked() will trigger an OverflowException
// if the character being converted is out of range for a byte.
bytes[i] = checked((byte)str[i]);
}
return bytes;
}
The Encoding class hierarchy is specifically designed for handling text. What you have here doesn't seem to be text, so you should avoid using these classes.
The standard encoders use the replacement character fallback strategy. If a character doesn't exist in the target character set, they encode a replacement character ('?' by default).
To me, that's worse than a silent failure; It's data corruption. I prefer that libraries tell me when my assumptions are wrong.
You can derive an encoder that throws an exception:
Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
If you are truly using only characters in Unicode's ASCII range then you'll never see an exception.

C# Network Stream getString method

I'm writing a library to simplify my network programming in future projects. I'm wanting it to be robust and efficient because this will be in nearly all of my projects in the future. (BTW both the server and the client will be using my library so I'm not assuming a protocol in my question) I'm writing a function for receiving strings from a network stream where I use 31 bytes of buffer and one for sentinel. The sentinel value will indicate which byte if any is the EOF. Here's my code for your use or scrutiny...
public string getString()
{
string returnme = "";
while (true)
{
int[] buff = new int[32];
for (int i = 0; i < 32; i++)
{
buff[i] = ns.ReadByte();
}
if (buff[31] > 31) { /*throw some error*/}
for (int i = 0; i < buff[31]; i++)
{
returnme += (char)buff[i];
}
if (buff[31] != 31)
{
break;
}
}
return returnme;
}
Edit: Is this the best (efficient, practical, etc) to accomplish what I'm doing.
Is this the best (efficient, practical, etc) to accomplish what I'm doing.
No. Firstly, you are limiting yourself to characters in the 0-255 code-point range, and that isn't enough, and secondly: serializing strings is a solved problem. Just use an Encoding, typically UTF-8. As part of a network stream, this probably means "encoode the length, encode the data" and "read the length, buffer that much data, decode the data". As another note: you aren't correctly handling the EOF scenario if ReadByte() returns a negative value.
As a small corollary, note that appending to a string in a loop is never a good idea; if you did do it that way, use a StringBuilder. But don't do it that way. My code would be something more like (hey, whadya know, here's my actual string-reading code from protobuf-net, simplified a bit):
// read the length
int bytes = (int)ReadUInt32Variant(false);
if (bytes == 0) return "";
// buffer that much data
if (available < bytes) Ensure(bytes, true);
// read the string
string s = encoding.GetString(ioBuffer, ioIndex, bytes);
// update the internal buffer data
available -= bytes;
position += bytes;
ioIndex += bytes;
return s;
As a final note, I would say: if you are sending structured messages, give some serious consideration to using a pre-rolled serialization API that specialises in this stuff. For example, you could then just do something like:
var msg = new MyMessage { Name = "abc", Value = 123, IsMagic = true };
Serializer.SerializeWithLengthPrefix(networkStream, msg);
and at the other end:
var msg = Serializer.DeserializeWithLengthPrefix<MyMessage>(networkStream);
Console.WriteLine(msg.Name); // etc
Job done.
I think tou should use a StringBuilder object with fixed size for better performance.

Reading a line in c# without trimming the line delimiter character

I've got a string that I want to read line-by-line, but I also need to have the line delimiter character, which StringReader.ReadLine unfortunately trims (unlike in ruby where it is kept). What is the fastest and most robust way to accomplish this?
Alternatives I've been thinking about:
Reading the input character-by-character and checking for the line delimiter each time
Using RegExp.Split with a positive lookahead
Alternatively I only care about the line delimiter because I need to know the actual position in the string, and the delimiter can be either one or tho character long. Therefore if I could get back the actual position of the cursor within the string would be also good, but StringReader doesn't have this feature.
EDIT: here is my current implementation. End-of-file is designated by returning an empty string.
StringBuilder line = new StringBuilder();
int r = _input.Read();
while (r >= 0)
{
char c = Convert.ToChar(r);
line.Append(c);
if (c == '\n') break;
if (c == '\r')
{
int peek = _input.Peek();
if (peek == -1) break;
if (Convert.ToChar(peek) != '\n') break;
}
r = _input.Read();
}
return line.ToString();
Are you concerned about inconsistencies between files (i.e. coming from Unix/Mac vs. Windows), or within files?
One very easy optimization if you know that individual files are consistent with themselves would be to only read the first line character-by-character and figure out what the delimiter is. Then determining the exact position of any other line would be simple math.
Failing that, I think I would go the character-by-character route. A regex seems too "clever." This sounds like a complex function and I think the most important thing would be to make it easy to write, read, understand, and most importantly debug.
There's another way to do this, which would be more efficient if your data source was a stream. Unfortunately it's not, as referenced in your comment, so you would have to create one first; however, I'll include the solution anyway, it might give you some inspiration:
public IEnumerable<int> GetLineStartIndices(string s)
{
yield return 0;
byte[] chars = Encoding.UTF8.GetBytes(s);
using (MemoryStream stream = new MemoryStream(chars))
{
using (StreamReader reader = new StreamReader(stream, Encoding.UTF8))
{
while (reader.ReadLine() != null)
{
yield return stream.Position;
}
}
}
}
This will give you back the start position of each new line. Obviously you can tweak this to do whatever else you need, i.e. do something else with the actual lines you read.
Just note that this has to make a copy of the string to create the byte array, so it's really not suitable for very large strings. It's a bit nicer than the char-by-char approach though, less bug-prone, so perhaps worth considering if the strings are not megabytes-long.
If you only care about the position: ReadLine() moves you to the next line. If you store the .Position of the stream underneath you can compare it to the .Position after the following ReadLine(). That's the length of the string you just read plus the delimiter.
Length of the delimiter is currentPosition - previousPosition - line.Length.
That way you could easily find out if it was 1 or 2 bytes (without knowing the details, but you said you care only about the positions anyway).
File.ReadAllText will get you all of the file contents. Yup. All. So you better check that file size before using it.
EDIT:
read it all in then create an enumerator that yields line by line.
foreach(string line in Read("some.file"))
{ ... }
private IEnumerator Read(string file)
{
string buffer = File.ReadAllText()
for (int index=0;index<buffer.length;index++)
{
string line = ... logic to build a "line" here
yield return line;
}
yield break;
}
FileStream fs = new FileStream("E:\\hh.txt", FileMode.Open, FileAccess.Read);
BinaryReader read = new BinaryReader(fs);
byte[] ch = read.ReadBytes((int)fs.Length);
byte[] che=new byte[(int)fs.Length];
int size = (int)fs.Length,j=0;
for ( int i =0; i <= (size-1); i++)
{
if (ch[i] != '|')
{
che[j] = ch[i];
j++;
}
}
richTextBox1.Text = Encoding.ASCII.GetString(che);
read.Close();
fs.Close();

How can I determine if a file is binary or text in c#? [duplicate]

This question already has answers here:
C# - Check if File is Text Based
(6 answers)
Closed 7 years ago.
I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?
There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a 64Kb (256x256) profile you can compare your runtime files against (within a % threshold).
Supposedly, this is how browsers' Auto-Detect Encoding feature works.
I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.
As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.
Sharing my solution in the hope it helps others as it helps me from these posts and forums.
Background
I have been researching and exploring a solution for the same. However, I expected it to be simple or slightly twisted.
However, most of the attempts provide convoluted solutions here as well as other sources and dives into Unicode, UTF-series, BOM, Encodings, Byte orders. In the process, I also went off-road and into Ascii Tables and Code pages too.
Anyways, I have come up with a solution based on the idea of stream reader and custom control characters check.
It is built taking into considerations various hints and tips provided on the forum and elsewhere such as:
Check for lot of control characters for example looking for multiple consecutive null characters.
Check for UTF, Unicode, Encodings, BOM, Byte Orders and similar aspects.
My goal is:
It should not rely on byte orders, encodings and other more involved esoteric work.
It should be relatively easy to implement and easy to understand.
It should work on all types of files.
The solution presented works for me on test data that includes mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. It gives results as expected so far.
How the solution works
I am relying on the StreamReader default constructor to do what it can do best with respect to determining file encoding related characteristics which uses UTF8Encoding by default.
I created my own version of check for custom control char condition because Char.IsControl does not seem useful. It says:
Control characters are formatting and other non-printing characters,
such as ACK, BEL, CR, FF, LF, and VT. Unicode standard assigns code
points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to
control characters. These values are to be interpreted as control
characters unless their use is otherwise defined by an application. It
considers LF and CR as control characters among other things
That makes it not useful since text files include CR and LF at least.
Solution
static void testBinaryFile(string folderPath)
{
List<string> output = new List<string>();
foreach (string filePath in getFiles(folderPath, true))
{
output.Add(isBinary(filePath).ToString() + " ---- " + filePath);
}
Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text);
}
public static List<string> getFiles(string path, bool recursive = false)
{
return Directory.Exists(path) ?
Directory.GetFiles(path, "*.*",
recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() :
new List<string>();
}
public static bool isBinary(string path)
{
long length = getSize(path);
if (length == 0) return false;
using (StreamReader stream = new StreamReader(path))
{
int ch;
while ((ch = stream.Read()) != -1)
{
if (isControlChar(ch))
{
return true;
}
}
}
return false;
}
public static bool isControlChar(int ch)
{
return (ch > Chars.NUL && ch < Chars.BS)
|| (ch > Chars.CR && ch < Chars.SUB);
}
public static class Chars
{
public static char NUL = (char)0; // Null char
public static char BS = (char)8; // Back Space
public static char CR = (char)13; // Carriage Return
public static char SUB = (char)26; // Substitute
}
If you try above solution, let me know it works for you or not.
Other interesting and related links:
About UTF and BOM on Unicode.org
Unicode sample files
How to detect encoding of text file and
Detect file encoding in Csharp
While this isn't foolproof, this should check to see if it has any binary content.
public bool HasBinaryContent(string content)
{
return content.Any(ch => char.IsControl(ch) && ch != '\r' && ch != '\n');
}
Because if any control character exist (aside from the standard \r\n), then it is probably not a text file.
If the real question here is "Can this file be read and written using StreamReader/StreamWriter without modification?", then the answer is here:
/// <summary>
/// Detect if a file is text and detect the encoding.
/// </summary>
/// <param name="encoding">
/// The detected encoding.
/// </param>
/// <param name="fileName">
/// The file name.
/// </param>
/// <param name="windowSize">
/// The number of characters to use for testing.
/// </param>
/// <returns>
/// true if the file is text.
/// </returns>
public static bool IsText(out Encoding encoding, string fileName, int windowSize)
{
using (var fileStream = File.OpenRead(fileName))
{
var rawData = new byte[windowSize];
var text = new char[windowSize];
var isText = true;
// Read raw bytes
var rawLength = fileStream.Read(rawData, 0, rawData.Length);
fileStream.Seek(0, SeekOrigin.Begin);
// Detect encoding correctly (from Rick Strahl's blog)
// http://www.west-wind.com/weblog/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader
if (rawData[0] == 0xef && rawData[1] == 0xbb && rawData[2] == 0xbf)
{
encoding = Encoding.UTF8;
}
else if (rawData[0] == 0xfe && rawData[1] == 0xff)
{
encoding = Encoding.Unicode;
}
else if (rawData[0] == 0 && rawData[1] == 0 && rawData[2] == 0xfe && rawData[3] == 0xff)
{
encoding = Encoding.UTF32;
}
else if (rawData[0] == 0x2b && rawData[1] == 0x2f && rawData[2] == 0x76)
{
encoding = Encoding.UTF7;
}
else
{
encoding = Encoding.Default;
}
// Read text and detect the encoding
using (var streamReader = new StreamReader(fileStream))
{
streamReader.Read(text, 0, text.Length);
}
using (var memoryStream = new MemoryStream())
{
using (var streamWriter = new StreamWriter(memoryStream, encoding))
{
// Write the text to a buffer
streamWriter.Write(text);
streamWriter.Flush();
// Get the buffer from the memory stream for comparision
var memoryBuffer = memoryStream.GetBuffer();
// Compare only bytes read
for (var i = 0; i < rawLength && isText; i++)
{
isText = rawData[i] == memoryBuffer[i];
}
}
}
return isText;
}
}
Great question! I was surprised myself that .NET does not provide an easy solution for this.
The following code worked for me to distinguish between images (png, jpg etc) and text files.
I just checked for consecutive nulls (0x00) in the first 512 bytes, as per suggestions by Ron Warholic and Adam Bruss:
if (File.Exists(path))
{
// Is it binary? Check for consecutive nulls..
byte[] content = File.ReadAllBytes(path);
for (int i = 1; i < 512 && i < content.Length; i++) {
if (content[i] == 0x00 && content[i-1] == 0x00) {
return Convert.ToBase64String(content);
}
}
// No? return text
return File.ReadAllText(path);
}
Obviously this is a quick-and-dirty approach, however it can be easily expanded by breaking the file into 10 chunks of 512 bytes each and check 8 one of the them for consecutive nulls (personally, I would deduce its a binary file if 2 or 3 of them match - nulls are rare in text files).
That should provide a pretty good solution for what you are after.
Quick and dirty is to use the file extension and look for common, text extensions such as .txt. For this, you can use the Path.GetExtension call. Anything else would not really be classed as "quick", though it may well be dirty.
A really really really dirty way would be to build a regex that takes only standard text, punctuation, symbols, and whitespace characters, load up a portion of the file in a text stream, then run it against the regex. Depending on what qualifies as a pure text file in your problem domain, no successful matches would indicate a binary file.
To account for unicode, make sure to mark the encoding on your stream as such.
This is really suboptimal, but you said quick and dirty.
http://codesnipers.com/?q=node/68 describes how to detect UTF-16 vs. UTF-8 using a Byte Order Mark (which may appear in your file). It also suggests looping through some bytes to see if they conform to the UTF-8 multi-byte sequence pattern (below) to determine if your file is a text file.
0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >=
0x400
11110xxx 10xxxxxx 10xxxxxx
10xxxxxx 4-byte >= 0x10000
How about another way: determine length of binary array, representing file's contents and compare it with length of string you will have after converting given binary array to text.
If length the same, there are no "none-readable' symbols in file, it's text (I'm sure on 80%).
Another way is to detect the file's charset using UDE. If charset detected successfully, you can be sure that it's text, otherwise it's binary. Because binary has no charset.
Of course you can use other charset detecting library other than UDE. If the charset detecting library is good enough, this approach could achieve 100% correctness.

Categories

Resources