I was searching for a way to check whether I've reached the end of a file for my binary reader and one suggestion was to use PeekChar as such
while (inFile.PeekChar() > 0)
{
...
}
However, it looks like I've run into an issue
Unhandled Exception: System.ArgumentException: The output char buffer is too sma
ll to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'Syste
m.Text.DecoderReplacementFallback'.
Parameter name: chars
at System.Text.Encoding.ThrowCharsOverflow()
at System.Text.Encoding.ThrowCharsOverflow(DecoderNLS decoder, Boolean nothin
gDecoded)
at System.Text.UTF8Encoding.GetChars(Byte* bytes, Int32 byteCount, Char* char
s, Int32 charCount, DecoderNLS baseDecoder)
at System.Text.DecoderNLS.GetChars(Byte* bytes, Int32 byteCount, Char* chars,
Int32 charCount, Boolean flush)
at System.Text.DecoderNLS.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteC
ount, Char[] chars, Int32 charIndex, Boolean flush)
at System.Text.DecoderNLS.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteC
ount, Char[] chars, Int32 charIndex)
at System.IO.BinaryReader.InternalReadOneChar()
at System.IO.BinaryReader.PeekChar()
So maybe PeekChar isn't the best way to do it, and I don't think it should even be used that way because I'm checking the current position of my reader and not really what the next character is supposed to be.
There is a more accurate way to check for EOF when working with binary data. It avoids all of the encoding issues that come with the PeekChar approach and does exactly what is needed: to check whether the position of the reader is at the end of the file or not.
while (inFile.BaseStream.Position != inFile.BaseStream.Length)
{
...
}
Wrapping it into a Custom Extension Method that'll extend the BinaryReader class by adding the missing EOF method.
public static class StreamEOF {
public static bool EOF( this BinaryReader binaryReader ) {
var bs = binaryReader.BaseStream;
return ( bs.Position == bs.Length);
}
}
So now you can just write:
while (!infile.EOF()) {
// Read....
}
:)
... assuming you have created infile somewhere like this:
var infile= new BinaryReader();
Note: var is implicit typing.
Happy to found it - it's other puzzle piece for well styled code in C#. :D
I suggest very similar to #MxLDevs, but with a '<' operator rather than a '!=' operator. As it is possible to set Position to anything you want (within long confines), this will stop any attempts to access an invalid file Position by the loop.
while (inFile.BaseStream.Position < inFile.BaseStream.Length)
{
...
}
This work for me:
using (BinaryReader br = new BinaryReader(File.Open(fileName,
FileMode.Open))) {
//int pos = 0;
//int length = (int)br.BaseStream.Length;
while (br.BaseStream.Position != br.BaseStream.Length) {
string nume = br.ReadString ();
string prenume = br.ReadString ();
Persoana p = new Persoana (nume, prenume);
myArrayList.Add (p);
Console.WriteLine ("ADAUGAT XXX: "+ p.ToString());
//pos++;
}
}
I'll add my suggestion: if you don't need the "encoding" part of the BinaryReader (so you don't use the various ReadChar/ReadChars/ReadString) then you can use an encoder that won't ever throw and that is always one-byte-per-char. Encoding.GetEncoding("iso-8859-1") is perfect for this. You pass it as a parameter of the BinaryReader constructor. The iso-8859-1 encoding is a one-byte-per-character encoding that maps 1:1 all the first 256 characters of Unicode (so the byte 254 is the char 254 for example)
Related
I'm looking for an efficient, allocation-free (!) implementation of
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
// Should return the index of the first byte of #char within utf8Bytes
// (not the character index of #char within the string)
}
I've not found a way to iterate through the span char by char yet. Utf8Parser does not have an overload supporting single characters.
And System.Text.Encoding seems to work mostly on the entire span, and does allocate internally while doing so.
Is there any builtin functionality I haven't spotted yet? If not, can anyone think of a reasonable custom implementation?
Rather than trying to iterate through the utf8Bytes character by character, it may be easier to convert the character to a short stackalloc'ed utf8 byte sequence, and search for that:
public static class StringExtensions
{
const int MaxBytes = 4;
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, char #char)
{
Rune rune;
try
{
rune = new Rune(#char);
}
catch (ArgumentOutOfRangeException)
{
// Malformed unicode character, return -1 or throw?
return -1;
}
return utf8Bytes.IndexOf(rune);
}
public static int IndexOf(this ReadOnlySpan<byte> utf8Bytes, Rune #char)
{
Span<byte> charBytes = stackalloc byte[MaxBytes];
var n = #char.EncodeToUtf8(charBytes);
charBytes = charBytes.Slice(0, n);
for (int i = 0, thisLength = 1; i <= utf8Bytes.Length - charBytes.Length; i += thisLength)
{
thisLength = Utf8ByteSequenceLength(utf8Bytes[i]);
if (thisLength == charBytes.Length && charBytes.CommonPrefixLength(utf8Bytes.Slice(i)) == charBytes.Length)
return i;
}
return -1;
}
static int Utf8ByteSequenceLength(byte firstByte)
{
//https://en.wikipedia.org/wiki/UTF-8#Encoding
if ( (firstByte & 0b11111000) == 0b11110000) // 11110xxx
return 4;
else if ((firstByte & 0b11110000) == 0b11100000) // 1110xxxx
return 3;
else if ((firstByte & 0b11100000) == 0b11000000) // 110xxxxx
return 2;
return 1; // Either a 1-byte sequence (matching 0xxxxxxx) or an invalid start byte.
}
}
Notes:
Rune is a struct introduced in .NET Core 3.x that represents a Unicode scalar value. If you need to search your utf8Bytes for a Unicode codepoint that is not in the basic multilingual plane, you will need to use Rune.
Rune has the added advantage that its method Rune.TryEncodeToUtf8() is lightweight and allocation-free.
If char #char is an invalid Unicode character, the .NET encoding algorithms will throw an exception if you attempt to construct a Rune from it. The above code catches the exception and returns -1. You may wish to rethrow the exception.
As an alternative, Rune.DecodeFromUtf8(ReadOnlySpan<Byte>, Rune, Int32) can be used to iterate through a utf8 byte span Rune by Rune. You could use that to locate an incoming Rune by index. However, I suspect doing so would be less efficient than the method above.
Demo fiddle here.
You can negate allocations with stackalloc. First approximation can look like:
static (int Found, int Processed) IndexOf(ReadOnlySpan<byte> utf8Bytes, char #char)
{
Span<char> chars = stackalloc char[utf8Bytes.Length]; // "worst" case every byte is a separate char
var proc = Encoding.UTF8.GetChars(utf8Bytes, chars);
var indexOf = chars.IndexOf(#char);
if (indexOf > 0)
{
Span<byte> bytes = stackalloc byte[indexOf * 4];
var result = Encoding.UTF8.GetBytes(chars.Slice(0, indexOf), bytes);
return (result, proc);
}
return (indexOf, proc);
}
There are few notes here:
Big incoming spans can result in SO
Decoding the whole array is not optimal
Span can contain "partial" codepoints at start and end so Processed should be processed accordingly
First two points can be mitigated by processing the incoming span in slices of smaller size (for example reading 4 bytes into 4 chars spand).
Actually I believe that System.IO.Pipelines handles the same issues (via System.Buffers I believe) though it 1) it can be not completely allocation free I believe 2) I still have not investigated it that much so would not be able to provide a completely working example.
From .NET 5 onwards, there's a library method EncodingExtensions.GetChars to help you.
Specifically, you want the overload that gets the byte data from a ReadOnlySpan and writes to an IBufferWriter<char>, which you can then implement to receive your characters one by one and run whatever on them (your matching algorithm, for example). This solution is allocation-free of course, as long as you put your custom buffer writer in a static field and allocate it only once.
EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it.
For example:
"blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?
Your best solution is to use an implementation of TextReader:
StreamReader if you're reading from a stream
StringReader if you're reading from a string
With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:
int ch = reader.Read();
Internally the magic is done through the C# Decoder class (which comes from your Encoding):
var decoder = Encoding.UTF7.GetDecoder();
The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.
Pseudocode
Untried, untested, and only happens to look like C#:
String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
StringBuilder sb = new StringBuilder();
TextReader rdr = new StreamReader(stm, encoding);
int ch = rdr.Read();
while (ch > 0) //returns -1 when we've hit the end, and 0 is null
{
sb.AppendChar(Char(ch));
int ch = rdr.Read();
}
return sb.ToString();
}
Note: Any code released into public domain. No attribution required.
I'm trying to convert a string to a byte[] using the ASCIIEncoder object in the .NET library. The string will never contain non-ASCII characters, but it will usually have a length greater than 16. My code looks like the following:
public static byte[] Encode(string packet)
{
ASCIIEncoder enc = new ASCIIEncoder();
byte[] byteArray = enc.GetBytes(packet);
return byteArray;
}
By the end of the method, the byte array should be full of packet.Length number of bytes, but Intellisense tells me that all bytes after byteArray[15] are literally questions marks that cannot be observed. I used Wireshark to view byteArray after I sent it and it was received on the other side fine, but the end device did not follow the instructions encoded in byteArray. I'm wondering if this has anything to do with Intellisense not being able to display all elements in byteArray, or if my packet is completely wrong.
If your packet string basically contains characters in the range 0-255, then ASCIIEncoding is not what you should be using. ASCII only defines character codes 0-127; anything in the range 128-255 will get turned into question marks (as you have observed) because there characters are not defined in ASCII.
Consider using a method like this to convert the string to a byte array. (This assumes that the ordinal value of each character is in the range 0-255 and that the ordinal value is what you want.)
public static byte[] ToOrdinalByteArray(this string str)
{
if (str == null) { throw new ArgumentNullException("str"); }
var bytes = new byte[str.Length];
for (int i = 0; i < str.Length; ++i) {
// Wrapping the cast in checked() will trigger an OverflowException
// if the character being converted is out of range for a byte.
bytes[i] = checked((byte)str[i]);
}
return bytes;
}
The Encoding class hierarchy is specifically designed for handling text. What you have here doesn't seem to be text, so you should avoid using these classes.
The standard encoders use the replacement character fallback strategy. If a character doesn't exist in the target character set, they encode a replacement character ('?' by default).
To me, that's worse than a silent failure; It's data corruption. I prefer that libraries tell me when my assumptions are wrong.
You can derive an encoder that throws an exception:
Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
If you are truly using only characters in Unicode's ASCII range then you'll never see an exception.
I m a newbie in C#.I want to create a struct in C# which consist of string variable of fixed size. example DistributorId of size [20]. What is the exact way of giving the string a fixed size.
public struct DistributorEmail
{
public String DistributorId;
public String EmailId;
}
If you need fixed, preallocated buffers, String is not the correct datatype.
This type of usage would only make sense in an interop context though, otherwise you should stick to Strings.
You will also need to compile your assembly with allow unsafe code.
unsafe public struct DistributorEmail
{
public fixed char DistributorId[20];
public fixed char EmailID[20];
public DistributorEmail(string dId)
{
fixed (char* distId = DistributorId)
{
char[] chars = dId.ToCharArray();
Marshal.Copy(chars, 0, new IntPtr(distId), chars.Length);
}
}
}
If for some reason you are in need of fixed size buffers, but not in an interop context, you can use the same struct but without unsafe and fixed. You will then need to allocate the buffers yourself.
Another important point to keep in mind, is that in .NET, sizeof(char) != sizeof(byte). A char is at the very least 2 bytes, even if it is encoded in ANSI.
If you really need a fixed length, you can always use a char[] instead of a string. It's easy to convert to/from, if you also need string manipulation.
string s = "Hello, world";
char[] ca = s.ToCharArray();
string s1 = new string(ca);
Note that, aside from some special COM interop scenarios, you can always just use strings, and let the framework worry about sizes and storage.
You can create a new fixed length string by specifying the length when you create it.
string(char c, int count)
This code will create a new string of 40 characters in length, filled with the space character.
string newString = new string(' ', 40);
As string extension, covers source string longer and shorter thand fixed:
public static string ToFixedLength(this string inStr, int length)
{
if (inStr.Length == length)
return inStr;
if(inStr.Length > length)
return inStr.Substring(0, length);
var blanks = Enumerable.Range(1, length - inStr.Length).Select(v => " ").Aggregate((a, b) => $"{a}{b}");
return $"{inStr}{blanks}";
}
I work with SQLite for C. I try to send UTF-8 Chars to .dll from c# app but everytime it's work different. For example sometimes it add "değirmenci" and another time with same code it add "değirmencil" but I don't change the word. And sometimes it's adding samething in the UNIQUE column ( I think there is a char but it is not visible like 0x01 in ascii)
Sorry about my English.
This is my c# code;
[DllImport("dllfile.dll", CharSet = CharSet.Unicode)]
static void Main()
{
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci");
int r;
//
IntPtr unmanagedPointer = Marshal.AllocHGlobal(bytes.Length);
Marshal.Copy(bytes, 0, unmanagedPointer, bytes.Length);
IntPtr ch = Tahmin_Baslat();
r = Sozcuk_Ekle(unmanagedPointer);
Console.WriteLine(r);
Console.Read();
//
}
and this is my C code
int Sozcuk_Ekle(const char* kok,int tip_1=1,int tip_2=0,int tip_3=0)
{
sqlite3 *ch;
int rc;
char *HataMsj = 0;
rc = sqlite3_open(veritabani, &ch); // Veritabanının açılması
if( rc )
{
return HATA_DEGERI;
}
char buff[strlen(kok) + 64];
sprintf(buff,"INSERT INTO kokler (kok,tip_1,tip_2,tip_3) VALUES('%s',%d,%d,%d)",kok,tip_1,tip_2,tip_3); // Verilerin Birleştirilmesi
sqlite3_exec(ch,buff,GeriBildirim,0,&HataMsj); // Komutun Yürütülmesi
sqlite3_close(ch); // Veritabanını kaynaklarının serbest bırakılması
return DOGRU_DEGERI; // Doğru Dönder
}
(header files etc. included)
And it is how it goes:
http://i.stack.imgur.com/BJ4fE.png
Solution
Adding NULL terminator to end of the bytes.
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci\0"); like this.
Check calling convention in the DllImport attribute (should be Cdecl). And add a NULL terminator to your UTF-8 string:
byte[] bytes = System.Text.Encoding.UTF8.GetBytes("değirmenci" + '\0');
This will add NULL terminator to the resulting UTF-8 string (which is not needed for native .NET strings).