I'm working on a solution to my other question which is reading the data in the 'zTXt' chunks of a PNG. I am as far as locating the chunks in the file, and reading the zTXt's keyword. I'm having trouble reading the compressed portion of zTXt. I've never worked with the DeflateStream object before, and am having some trouble with it. When reading, it appears to expect the length parameter to be in 'uncompressed' bytes. In my case however, I only know the length of the data in 'compressed' bytes. To hopefully get around this, I put all the data that needed to be decompressed into a MemoryStream, and then 'read to end' with a DeflateStream. Now that's just peachy, except it throws an InvalidDataException with the message "Block length does not match with its complement." Now I have no idea what this means. What could be going wrong?
The format of a chunk is 4 bytes for the ID ("zTXt"), a big-endian 32-bit int for the data length, the data, and finally a CRC32 checksum which I am ignoring for now.
The format of the zTXt chunk is first a null-terminated (string as a keyword), then one byte for the compression method (always 0, the DEFLATE method), with the rest of the data being compressed text.
My method takes in a fresh FileStream, and returns a dictionary with the zTXt keywords and data.
Here is the monster now:
public static List<KeyValuePair<string, string>> GetZtxt(FileStream stream)
{
var ret = new List<KeyValuePair<string, string>>();
try {
stream.Position = 0;
var br = new BinaryReader(stream, Encoding.ASCII);
var head = br.ReadBytes(8); // The header is the same for all PNGs.
if (!head.SequenceEqual(new byte[] { 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A })) return null; // Not a PNG.
while (stream.Position < stream.Length) {
int len; // Length of chunk data.
if (BitConverter.IsLittleEndian)
len = BitConverter.ToInt32(br.ReadBytes(4).Reverse().ToArray(), 0);
else
len = br.ReadInt32();
char[] cName = br.ReadChars(4); // The chunk type.
if (cName.SequenceEqual(new[] { 'z', 'T', 'X', 't' })) {
var sb = new StringBuilder(); // Builds the null-terminated keyword associated with the chunk.
char c = br.ReadChar();
do {
sb.Append(c);
c = br.ReadChar();
}
while (c != '\0');
byte method = br.ReadByte(); // The compression method. Should always be 0. (DEFLATE method.)
if (method != 0) {
stream.Seek(len - sb.Length + 3, SeekOrigin.Current); // If not 0, skip the rest of the chunk.
continue;
}
var data = br.ReadBytes(len - sb.Length - 1); // Rest of the chunk data...
var ms = new MemoryStream(data, 0, data.Length); // ...in a MemoryStream...
var ds = new DeflateStream(ms, CompressionMode.Decompress); // ...read by a DeflateStream...
var sr = new StreamReader(ds); // ... and a StreamReader. Yeesh.
var str = sr.ReadToEnd(); // !!! InvalidDataException !!!
ret.Add(new KeyValuePair<string, string>(sb.ToString(), str));
stream.Seek(4, SeekOrigin.Current); // Skip the CRC check.
}
else {
stream.Seek(len + 4, SeekOrigin.Current); // Skip the rest of the chunk.
}
}
}
catch (IOException) { }
catch (InvalidDataException) { }
catch (ArgumentOutOfRangeException) { }
return ret;
}
Once this is tackled, I'll need to write a function that ADDS these zTXt chunks to the file. So hopefully I'll understand how the DeflateStream works once this is solved.
Thanks, much!!
After all this time, I've finally found the problem. The data is in zlib format, which has a bit more data stored than just using DEFLATE alone. The file is read properly if I just read the 2 extra bytes in right before I get the compressed data.
See this feedback page. (I did not submit that one.)
I'm wondering now. The value of those two bytes are 0x78 and 0x9C respectively. If I find values other than those, should I assume the DEFLATE is going to fail?
Related
I have a byte[] array that is loaded from a file that I happen to known contains UTF-8.
In some debugging code, I need to convert it to a string. Is there a one-liner that will do this?
Under the covers it should be just an allocation and a memcopy, so even if it is not implemented, it should be possible.
string result = System.Text.Encoding.UTF8.GetString(byteArray);
There're at least four different ways doing this conversion.
Encoding's GetString, but you won't be able to get the original bytes back if those bytes have non-ASCII characters.
BitConverter.ToString The output is a "-" delimited string, but there's no .NET built-in method to convert the string back to byte array.
Convert.ToBase64String You can easily convert the output string back to byte array by using Convert.FromBase64String. Note: The output string could contain '+', '/' and '='. If you want to use the string in a URL, you need to explicitly encode it.
HttpServerUtility.UrlTokenEncodeYou can easily convert the output string back to byte array by using HttpServerUtility.UrlTokenDecode. The output string is already URL friendly! The downside is it needs System.Web assembly if your project is not a web project.
A full example:
byte[] bytes = { 130, 200, 234, 23 }; // A byte array contains non-ASCII (or non-readable) characters
string s1 = Encoding.UTF8.GetString(bytes); // ���
byte[] decBytes1 = Encoding.UTF8.GetBytes(s1); // decBytes1.Length == 10 !!
// decBytes1 not same as bytes
// Using UTF-8 or other Encoding object will get similar results
string s2 = BitConverter.ToString(bytes); // 82-C8-EA-17
String[] tempAry = s2.Split('-');
byte[] decBytes2 = new byte[tempAry.Length];
for (int i = 0; i < tempAry.Length; i++)
decBytes2[i] = Convert.ToByte(tempAry[i], 16);
// decBytes2 same as bytes
string s3 = Convert.ToBase64String(bytes); // gsjqFw==
byte[] decByte3 = Convert.FromBase64String(s3);
// decByte3 same as bytes
string s4 = HttpServerUtility.UrlTokenEncode(bytes); // gsjqFw2
byte[] decBytes4 = HttpServerUtility.UrlTokenDecode(s4);
// decBytes4 same as bytes
A general solution to convert from byte array to string when you don't know the encoding:
static string BytesToStringConverted(byte[] bytes)
{
using (var stream = new MemoryStream(bytes))
{
using (var streamReader = new StreamReader(stream))
{
return streamReader.ReadToEnd();
}
}
}
Definition:
public static string ConvertByteToString(this byte[] source)
{
return source != null ? System.Text.Encoding.UTF8.GetString(source) : null;
}
Using:
string result = input.ConvertByteToString();
Converting a byte[] to a string seems simple, but any kind of encoding is likely to mess up the output string. This little function just works without any unexpected results:
private string ToString(byte[] bytes)
{
string response = string.Empty;
foreach (byte b in bytes)
response += (Char)b;
return response;
}
I saw some answers at this post and it's possible to be considered completed base knowledge, because I have a several approaches in C# Programming to resolve the same problem. The only thing that is necessary to be considered is about a difference between pure UTF-8 and UTF-8 with a BOM.
Last week, at my job, I needed to develop one functionality that outputs CSV files with a BOM and other CSV files with pure UTF-8 (without a BOM). Each CSV file encoding type will be consumed by different non-standardized APIs. One API reads UTF-8 with a BOM and the other API reads without a BOM. I needed to research the references about this concept, reading the "What's the difference between UTF-8 and UTF-8 without BOM?" Stack Overflow question, and the Wikipedia article "Byte order mark" to build my approach.
Finally, my C# Programming for both UTF-8 encoding types (with BOM and pure) needed to be similar to this example below:
// For UTF-8 with BOM, equals shared by Zanoni (at top)
string result = System.Text.Encoding.UTF8.GetString(byteArray);
//for Pure UTF-8 (without B.O.M.)
string result = (new UTF8Encoding(false)).GetString(byteArray);
Using (byte)b.ToString("x2"), Outputs b4b5dfe475e58b67
public static class Ext {
public static string ToHexString(this byte[] hex)
{
if (hex == null) return null;
if (hex.Length == 0) return string.Empty;
var s = new StringBuilder();
foreach (byte b in hex) {
s.Append(b.ToString("x2"));
}
return s.ToString();
}
public static byte[] ToHexBytes(this string hex)
{
if (hex == null) return null;
if (hex.Length == 0) return new byte[0];
int l = hex.Length / 2;
var b = new byte[l];
for (int i = 0; i < l; ++i) {
b[i] = Convert.ToByte(hex.Substring(i * 2, 2), 16);
}
return b;
}
public static bool EqualsTo(this byte[] bytes, byte[] bytesToCompare)
{
if (bytes == null && bytesToCompare == null) return true; // ?
if (bytes == null || bytesToCompare == null) return false;
if (object.ReferenceEquals(bytes, bytesToCompare)) return true;
if (bytes.Length != bytesToCompare.Length) return false;
for (int i = 0; i < bytes.Length; ++i) {
if (bytes[i] != bytesToCompare[i]) return false;
}
return true;
}
}
There is also class UnicodeEncoding, quite simple in usage:
ByteConverter = new UnicodeEncoding();
string stringDataForEncoding = "My Secret Data!";
byte[] dataEncoded = ByteConverter.GetBytes(stringDataForEncoding);
Console.WriteLine("Data after decoding: {0}", ByteConverter.GetString(dataEncoded));
In addition to the selected answer, if you're using .NET 3.5 or .NET 3.5 CE, you have to specify the index of the first byte to decode, and the number of bytes to decode:
string result = System.Text.Encoding.UTF8.GetString(byteArray, 0, byteArray.Length);
Alternatively:
var byteStr = Convert.ToBase64String(bytes);
The BitConverter class can be used to convert a byte[] to string.
var convertedString = BitConverter.ToString(byteAttay);
Documentation of BitConverter class can be fount on MSDN.
To my knowledge none of the given answers guarantee correct behavior with null termination. Until someone shows me differently I wrote my own static class for handling this with the following methods:
// Mimics the functionality of strlen() in c/c++
// Needed because niether StringBuilder or Encoding.*.GetString() handle \0 well
static int StringLength(byte[] buffer, int startIndex = 0)
{
int strlen = 0;
while
(
(startIndex + strlen + 1) < buffer.Length // Make sure incrementing won't break any bounds
&& buffer[startIndex + strlen] != 0 // The typical null terimation check
)
{
++strlen;
}
return strlen;
}
// This is messy, but I haven't found a built-in way in c# that guarentees null termination
public static string ParseBytes(byte[] buffer, out int strlen, int startIndex = 0)
{
strlen = StringLength(buffer, startIndex);
byte[] c_str = new byte[strlen];
Array.Copy(buffer, startIndex, c_str, 0, strlen);
return Encoding.UTF8.GetString(c_str);
}
The reason for the startIndex was in the example I was working on specifically I needed to parse a byte[] as an array of null terminated strings. It can be safely ignored in the simple case
A LINQ one-liner for converting a byte array byteArrFilename read from a file to a pure ASCII C-style zero-terminated string would be this: Handy for reading things like file index tables in old archive formats.
String filename = new String(byteArrFilename.TakeWhile(x => x != 0)
.Select(x => x < 128 ? (Char)x : '?').ToArray());
I use '?' as the default character for anything not pure ASCII here, but that can be changed, of course. If you want to be sure you can detect it, just use '\0' instead, since the TakeWhile at the start ensures that a string built this way cannot possibly contain '\0' values from the input source.
Try this console application:
static void Main(string[] args)
{
//Encoding _UTF8 = Encoding.UTF8;
string[] _mainString = { "Hello, World!" };
Console.WriteLine("Main String: " + _mainString);
// Convert a string to UTF-8 bytes.
byte[] _utf8Bytes = Encoding.UTF8.GetBytes(_mainString[0]);
// Convert UTF-8 bytes to a string.
string _stringuUnicode = Encoding.UTF8.GetString(_utf8Bytes);
Console.WriteLine("String Unicode: " + _stringuUnicode);
}
Here is a result where you didn’t have to bother with encoding. I used it in my network class and send binary objects as string with it.
public static byte[] String2ByteArray(string str)
{
char[] chars = str.ToArray();
byte[] bytes = new byte[chars.Length * 2];
for (int i = 0; i < chars.Length; i++)
Array.Copy(BitConverter.GetBytes(chars[i]), 0, bytes, i * 2, 2);
return bytes;
}
public static string ByteArray2String(byte[] bytes)
{
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; i++)
chars[i] = BitConverter.ToChar(bytes, i * 2);
return new string(chars);
}
string result = ASCIIEncoding.UTF8.GetString(byteArray);
I have wrote the following simple test:
[Test]
public void TestUTF8()
{
var c = "abc☰def";
var b = Encoding.UTF8.GetBytes(c);
Assert.That(b.Length, Is.EqualTo(9));
//Assuming, you are reading a byte stream and got partial result with the first 5 bytes
var p = Encoding.UTF8.GetChars(b, 0, 5);
Trace.WriteLine(new string(p));
Assert.That(p.Length, Is.EqualTo(3));
}
The Trace outputs abc� and the last assert fails because p.Length is 4.
However, I wanted Trace outputs abc and the last assert passes, since in reality I know the stream will have valid chars and when it is not the case for the last few bytes, just leave them there waiting for more data to come.
So how can I achieve this in C#?
Encoding.GetChars isn't really designed for bytes coming from a stream where some state needs to be kept track of during the decoding process because a single character might span multiple buffer segments. To do that work you should use a Decoder obtained from Encoding.GetDecoder. However, Decoder.Convert is really low-level allowing you control over both the input and output buffers and somewhat difficult to use. Decoder.GetChars is somewhat easier to use and does the important work of storing state between calls. We can easily expand on Peter Duniho's answer for arbitrary buffer size:
public static void Main(string[] args)
{
var c = "abc☰def";
var b = Encoding.UTF8.GetBytes(c);
var result = DecodeFromStream(new MemoryStream(b), Encoding.UTF8, 3);
Console.WriteLine(result);
Console.WriteLine(c == result);
}
private static string DecodeFromStream(Stream dataStream, Encoding encoding, int bufferSize)
{
Decoder decoder = encoding.GetDecoder();
StringBuilder sb = new StringBuilder();
int inputByteCount;
byte[] inputBuffer = new byte[bufferSize];
char[] charBuffer = new char[encoding.GetMaxCharCount(inputBuffer.Length)];
while ((inputByteCount = dataStream.Read(inputBuffer, 0, inputBuffer.Length)) > 0)
{
int readChars = decoder.GetChars(inputBuffer, 0, inputByteCount, charBuffer, 0);
if (readChars > 0)
sb.Append(charBuffer, 0, readChars);
}
return sb.ToString();
}
I am doing some data chunking and I'm seeing an interesting issue when sending binary data in my response. I can confirm that the length of the byte array is below my data limit of 4 megabytes, but when I receive the message, it's total size is over 4 megabytes.
For the example below, I used the largest chunk size I could so I could illustrate the issue while still receiving a usable chunk.
The size of the binary data is 3,040,870 on the service side and the client (once the message is deserialized). However, I can also confirm that the byte array is actually just under 4 megabytes (this was done by actually copying the binary data from the message and pasting it into a text file).
So, is WCF causing these issues and, if so, is there anything I can do to prevent it? If not, what might be causing this inflation on my side?
Thanks!
The usual way of sending byte[]s in SOAP messages is to base64-encode the data. This encoding takes 33% more space than binary encoding, which accounts for the size difference almost precisely.
You could adjust the max size or chunk size slightly so that the end result is within the right range, or use another encoding, e.g. MTOM, to eliminate this 33% overhead.
If you're stuck with soap, you can offset the buffer overhead Tim S. talked about using the System.IO.Compression library in .Net - You'd use the compress function first, before building and sending the soap message.
You'd compress with this:
public static byte[] Compress(byte[] data)
{
MemoryStream ms = new MemoryStream();
DeflateStream ds = new DeflateStream(ms, CompressionMode.Compress);
ds.Write(data, 0, data.Length);
ds.Flush();
ds.Close();
return ms.ToArray();
}
On the receiving end, you'd use this to decompress:
public static byte[] Decompress(byte[] data)
{
const int BUFFER_SIZE = 256;
byte[] tempArray = new byte[BUFFER_SIZE];
List<byte[]> tempList = new List<byte[]>();
int count = 0;
int length = 0;
MemoryStream ms = new MemoryStream(data);
DeflateStream ds = new DeflateStream(ms, CompressionMode.Decompress);
while ((InlineAssignHelper(count, ds.Read(tempArray, 0, BUFFER_SIZE))) > 0) {
if (count == BUFFER_SIZE) {
tempList.Add(tempArray);
tempArray = new byte[BUFFER_SIZE];
} else {
byte[] temp = new byte[count];
Array.Copy(tempArray, 0, temp, 0, count);
tempList.Add(temp);
}
length += count;
}
byte[] retVal = new byte[length];
count = 0;
foreach (byte[] temp in tempList) {
Array.Copy(temp, 0, retVal, count, temp.Length);
count += temp.Length;
}
return retVal;
}
I need to be able to insert audio data into existing ac3 files. AC3 files are pretty simple and can be appended to each other without stripping headers or anything. The problem I have is that if you want to add/overwrite/erase a chunk of an ac3 file, you have to do it in 32ms increments, and each 32ms is equal to 1536 bytes of data. So when I insert a data chunk (which must be 1536 bytes, as I just said), I need to find the nearest offset that is divisible by 1536 (like 0, 1536 (0x600), 3072 (0xC00), etc). Let's say I can figure that out. I've read about changing a particular character at a specific offset, but I need to INSERT (not overwrite) that entire 1536-byte data chunk. How would I do that in C#, given the starting offset and the 1536-byte data chunk?
Edit: The data chunk I want to insert is basically just 32ms of silence, and I have the hex, ASCII and ANSI text translations of it. Of course, I may want to insert this chunk multiple times to get 128ms of silence instead of just 32, for example.
byte[] filbyte=File.ReadAllBytes(#"C:\abc.ac3");
byte[] tobeinserted=;//allocate in your way using encoding whatever
byte[] total=new byte[filebyte.Length+tobeinserted.Length];
for(int i=0;int j=0;i<total.Length;)
{
if(i==1536*pos)//make pos your choice
{
while(j<tobeinserted.Length)
total[i++]=tobeinserted[j++];
}
else{total[i++]=filbyte[i-j];}
}
File.WriteAllBytes(#"C:\abc.ac3",total);
Here is the helper method that will do what you need:
public static void Insert(string filepath, int insertOffset, Stream dataToInsert)
{
var newFilePath = filepath + ".tmp";
using (var source = File.OpenRead(filepath))
using (var destination = File.OpenWrite(newFilePath))
{
CopyTo(source, destination, insertOffset);// first copy the data before insert
dataToInsert.CopyTo(destination);// write data that needs to be inserted:
CopyTo(source, destination, (int)(source.Length - insertOffset)); // copy remaining data
}
// delete old file and rename new one:
File.Delete(filepath);
File.Move(newFilePath, filepath);
}
private static void CopyTo(Stream source, Stream destination, int count)
{
const int bufferSize = 32 * 1024;
var buffer = new byte[bufferSize];
var remaining = count;
while (remaining > 0)
{
var toCopy = remaining > bufferSize ? bufferSize : remaining;
var actualRead = source.Read(buffer, 0, toCopy);
destination.Write(buffer, 0, actualRead);
remaining -= actualRead;
}
}
And here is an NUnit test with example usage:
[Test]
public void TestInsert()
{
var originalString = "some original text";
var insertString = "_ INSERTED TEXT _";
var insertOffset = 8;
var file = #"c:\someTextFile.txt";
if (File.Exists(file))
File.Delete(file);
using (var originalData = new MemoryStream(Encoding.ASCII.GetBytes(originalString)))
using (var f = File.OpenWrite(file))
originalData.CopyTo(f);
using (var dataToInsert = new MemoryStream(Encoding.ASCII.GetBytes(insertString)))
Insert(file, insertOffset, dataToInsert);
var expectedText = originalString.Insert(insertOffset, insertString);
var actualText = File.ReadAllText(file);
Assert.That(actualText, Is.EqualTo(expectedText));
}
Be aware that I have removed some checks for code clarity - do not forget to check for null, file access permissions and file size. For example insertOffset can be bigger than file length - this condition is not checked here.
Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var messageBuilder = new StringBuilder();
var nextChar = 'x';
while (reader.Peek() >= 0)
{
nextChar = (char)reader.Read()
messageBuilder.Append(nextChar);
if (nextChar == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
}
The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).
This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
How can I modify this code so that it works with multi-byte characters?
Rather than Encoding.UTF8.GetChars which is designed to convert complete buffers, get an instance of Decoder and repeatedly call its member method GetChars this will make use of the Decoder's internal buffer to handle partial multi-byte sequences from the end of one call to the next.
Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];
while ((byteAsInt = stream.ReadByte()) != -1)
{
var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
if(charCount == 0) continue;
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)
var messageBuilder = new List<byte>();
int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
messageBuilder.Add((byte)byteAsInt);
if (byteAsInt == '\r')
{
var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
Console.Write(messageString);
ProcessBuffer(messageString);
messageBuilder.Clear();
}
}
Mike,
I found your solution perfect for my situation as well. But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback."
I changed my code to:
// ...
var nextChar = new char[4]; // 2 might suffice
for (var i = startPos; i < bytesRead; i++)
{
int charCount;
//...
charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);
if (charCount == 0)
{
bytesSkipped++;
continue;
}
for (int ic = 0; ic < charCount; ic++)
{
char c = nextChar[ic];
charPos++;
// Process character here...
}
}