I collect some large log infos using a C# tool. Therefore I searched for a way to compress that giant string and I found this snippet to do the trick:
public static string CompressString(string text)
{
byte[] buffer = Encoding.UTF8.GetBytes(text);
var memoryStream = new MemoryStream();
using (var gZipStream = new GZipStream(memoryStream, CompressionMode.Compress, true))
{
gZipStream.Write(buffer, 0, buffer.Length);
}
memoryStream.Position = 0;
var compressedData = new byte[memoryStream.Length];
memoryStream.Read(compressedData, 0, compressedData.Length);
var gZipBuffer = new byte[compressedData.Length + 4];
Buffer.BlockCopy(compressedData, 0, gZipBuffer, 4, compressedData.Length);
Buffer.BlockCopy(BitConverter.GetBytes(buffer.Length), 0, gZipBuffer, 0, 4);
return Convert.ToBase64String(gZipBuffer);
}
After my logging action the C# tool sends this compressed String to a node.js REST interface which writes it into a database.
Now (in my naive understanding of compression) I thought that I could simply use something like the follwoing code on nodejs side to uncompress it:
zlib.gunzip(Buffer.from(compressedLogMessage, 'base64'), function(err, uncompressedLogMessage) {
if(err) {
console.error(err);
}
else {
console.log(uncompressedLogMessage.toString('utf-8'));
}
});
But I get the error:
{ Error: incorrect header check
at Zlib._handle.onerror (zlib.js:370:17) errno: -3, code: 'Z_DATA_ERROR' }
It seems that the compression method does not match with the uncompression function. I expect that anyone with compression/uncompression knowledge could maybe see the issue(s) immediately.
What could I change or improve to make the uncompression work?
Thanks a lot!
========== UPDATE ===========
It seems that message receiving and base64 decoding works..
Using CompressString("Hello World") results in:
// before compression
"Hello World"
// after compression before base64 encoding
new byte[] { 11, 0, 0, 0, 31, 139, 8, 0, 0, 0, 0, 0, 0, 3, 243, 72, 205, 201, 201, 87, 8, 207, 47, 202, 73, 1, 0, 86, 177, 23, 74, 11, 0, 0, 0 }
// after base64 encoding
CwAAAB+LCAAAAAAAAAPzSM3JyVcIzy/KSQEAVrEXSgsAAAA=
And on node js side:
// after var buf = Buffer.from('CwAAAB+LCAAAAAAAAAPzSM3JyVcIzy/KSQEAVrEXSgsAAAA=', 'base64');
{"buf":{"type":"Buffer","data":[11,0,0,0,31,139,8,0,0,0,0,0,0,3,243,72,205,201,201,87,8,207,47,202,73,1,0,86,177,23,74,11,0,0,0]}}
// after zlib.gunzip(buf, function(err, dezipped) { ... }
{ Error: incorrect header check
at Zlib._handle.onerror (zlib.js:370:17) errno: -3, code: 'Z_DATA_ERROR' }
=============== Update 2 ==================
#01binary's answer was correct! That's the working solution:
function toArrayBuffer(buffer) {
var arrayBuffer = new ArrayBuffer(buffer.length);
var view = new Uint8Array(arrayBuffer);
for (var i = 0; i < buffer.length; ++i) {
view[i] = buffer[i];
}
return arrayBuffer;
}
// Hello World (compressed with C#) => CwAAAB+LCAAAAAAAAAPzSM3JyVcIzy/KSQEAVrEXSgsAAAA=
var arrayBuffer = toArrayBuffer(Buffer.from('CwAAAB+LCAAAAAAAAAPzSM3JyVcIzy/KSQEAVrEXSgsAAAA=', 'base64'))
var zlib = require('zlib');
zlib.gunzip(Buffer.from(arrayBuffer, 4), function(err, uncompressedMessage) {
if(err) {
console.log(err)
}
else {
console.log(uncompressedMessage.toString()) // Hello World
}
});
The snippet you found appears to write 4 extra bytes to the beginning of the output stream, containing the "uncompressed" size of the original data. The original author must have assumed that logic on the receiving end is going to read those 4 bytes, know that it needs to allocate a buffer of that size, and pass the rest of the stream (at +4 offset) to gunzip.
If you are using this signature on the Node side:
https://nodejs.org/api/buffer.html#buffer_class_method_buffer_from_arraybuffer_byteoffset_length
...then pass a byte offset of 4. The first two bytes of your gzip stream should be { 0x1F, 0x8b }, and you can see in your array that those two bytes start at offset 4. A simple example of the zlib header can be found here:
Zlib compression incompatibile C vs C# implementations
Related
I need to read bytes (converted from string to byte array and then sent to stream) from stream and stop reading as soon as I encounter specific sequence, in my case it's [13, 10, 13, 10] or "\r\n\r\n" if converted to string (ASCII).
Currently I have two versions of the same process:
1) Read from stream one byte at time and check EVERY byte if last 4 bytes of read sequence equals [13, 10, 13, 10] (note that I can't read and check every 4 bytes as sequence can be 7 bytes long, for example, so it'll read first 4 bytes and then stuck because only 3 of 4 bytes are available):
NetworkStream streamBrowser = tcpclientBrowser.GetStream();
byte[] data;
using (MemoryStream ms = new MemoryStream())
{
byte[] check = new byte[4] { 13, 10, 13, 10 };
byte[] buff = new byte[1];
do
{
streamBrowser.Read(buff, 0, 1);
ms.Write(buff, 0, 1);
data = ms.ToArray();
} while (!data.Skip(data.Length - 4).SequenceEqual(check));
}
2) Use StreamReader.ReadLine to read until "\r\n" and then read again to see if returned line is null, and then add to first returned string "\r\n", that way I'll get string that ends with "\r\n\r\n".
My question is - what method is preferable in terms of perfomance (if any, it may be that both are too slow and there are better way which I really would want to know)?
I am drunk, but maybe it will help:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApp5
{
class Program
{
static void Main (string[] args)
{
var ms = new MemoryStream (new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 });
var sequence = new byte[] { 6, 7, 8 };
var buffer = new byte[1];
var queue = new Queue<byte> ();
int position = 0;
while (true)
{
int count = ms.Read (buffer, 0, 1);
if (count == 0) return;
queue.Enqueue (buffer[0]);
position++;
if (IsSequenceFound (queue, sequence))
{
Console.WriteLine ("Found sequence at position: " + (position - queue.Count));
return;
}
if (queue.Count == sequence.Length) queue.Dequeue ();
}
}
static bool IsSequenceFound (Queue<byte> queue, byte[] sequence)
{
return queue.SequenceEqual (sequence);
// normal for (int i ...) can be faster
}
}
}
Is there a way in C# to convert a plain byte array to an object?
e.g given this class:
class Data
{
public int _int1;
public int _int2;
public short _short1;
public long _long1;
}
I want to basically be able to do something like this:
var bytes = new byte[] { 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 0, 0, 0 };
var obj = (Data)bytes;
You could try marshalling:
Declare the layout of your class as Sequential (and note that you will need to use Pack = 1):
[StructLayout(LayoutKind.Sequential, Pack = 1)]
class Data
{
public int _int1;
public int _int2;
public short _short1;
public long _long1;
}
Marshal the bytes into a new instance of the Data class:
var bytes = new byte[] { 1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 0, 0, 0 };
GCHandle gcHandle = GCHandle.Alloc(bytes, GCHandleType.Pinned);
var data = (Data)Marshal.PtrToStructure(gcHandle.AddrOfPinnedObject(), typeof(Data));
gcHandle.Free();
// Now data should contain the correct values.
Console.WriteLine(data._int1); // Prints 1
Console.WriteLine(data._int2); // Prints 2
Console.WriteLine(data._short1); // Prints 3
Console.WriteLine(data._long1); // Prints 4
For convenience you could write a static method on Data to do the conversion:
[StructLayout(LayoutKind.Sequential, Pack = 1)]
class Data
{
public int _int1;
public int _int2;
public short _short1;
public long _long1;
public static Data FromBytes(byte[] bytes)
{
GCHandle gcHandle = GCHandle.Alloc(bytes, GCHandleType.Pinned);
var data = (Data)Marshal.PtrToStructure(gcHandle.AddrOfPinnedObject(), typeof(Data));
gcHandle.Free();
return data;
}
}
...
var data = Data.FromBytes(new byte[] {1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 0, 0, 0});
If you really wanted to you could write an explicit operator to convert from an array of bytes, to get the syntax in your OP. I would suggest just using Data.FromBytes() which is going to be a lot clearer IMO.
Still, just for completeness:
[StructLayout(LayoutKind.Sequential, Pack = 1)]
class Data
{
public int _int1;
public int _int2;
public short _short1;
public long _long1;
public static explicit operator Data(byte[] bytes)
{
GCHandle gcHandle = GCHandle.Alloc(bytes, GCHandleType.Pinned);
var data = (Data)Marshal.PtrToStructure(gcHandle.AddrOfPinnedObject(), typeof(Data));
gcHandle.Free();
return data;
}
}
...
var data = (Data)new byte[] {1, 0, 0, 0, 2, 0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 0, 0, 0};
Use BitConverter.ToInt32/Int16/Int64 methods. You have only have to specify the starting index like:
Data data = new Data();
data._int1 = BitConverter.ToInt32(bytes, 0);
data._int2 = BitConverter.ToInt32(bytes, 4);
data._short1 = BitConverter.ToInt16(bytes, 8);
data._long1 = BitConverter.ToInt64(bytes,10);
Just remember:
BitConverter.ToInt32
The order of bytes in the array must reflect the endianness of the
computer system's architecture;
Here is a way to convert a byte array into an object.
var binaryFormatter = new BinaryFormatter();
using (var ms = new MemoryStream(bytes))
{
object obj = binaryFormatter.Deserialize(ms);
return (Data)obj;
}
There is nothing that will do the conversion in one go.
But you can build on top of BitConverter:
var d = new Data();
var sI32 = sizeof(Int32);
d._int1 = BitConverter.ToInt32(bytes, 0);
d._int2 = BitConverter.ToInt32(bytes, sI32);
d._short1 = BitConverter.ToInt16(bytes, 2*sI32);
…
How to identify doc, docx, pdf, xls and xlsx based on file header in C#?
I don't want to rely on the file extensions neither MimeMapping.GetMimeMapping for this as either of the two can be manipulated.
I know how to read the header but dont know what combination of bytes can say if a file is a doc, docx, pdf, xls or xlsx.
Any thoughts?
This question contains a example of using the first bytes of a file to determine the file type: Using .NET, how can you find the mime type of a file based on the file signature not the extension
It is a very long post, so I am posting the relevant answer below:
public class MimeType
{
private static readonly byte[] BMP = { 66, 77 };
private static readonly byte[] DOC = { 208, 207, 17, 224, 161, 177, 26, 225 };
private static readonly byte[] EXE_DLL = { 77, 90 };
private static readonly byte[] GIF = { 71, 73, 70, 56 };
private static readonly byte[] ICO = { 0, 0, 1, 0 };
private static readonly byte[] JPG = { 255, 216, 255 };
private static readonly byte[] MP3 = { 255, 251, 48 };
private static readonly byte[] OGG = { 79, 103, 103, 83, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0 };
private static readonly byte[] PDF = { 37, 80, 68, 70, 45, 49, 46 };
private static readonly byte[] PNG = { 137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13, 73, 72, 68, 82 };
private static readonly byte[] RAR = { 82, 97, 114, 33, 26, 7, 0 };
private static readonly byte[] SWF = { 70, 87, 83 };
private static readonly byte[] TIFF = { 73, 73, 42, 0 };
private static readonly byte[] TORRENT = { 100, 56, 58, 97, 110, 110, 111, 117, 110, 99, 101 };
private static readonly byte[] TTF = { 0, 1, 0, 0, 0 };
private static readonly byte[] WAV_AVI = { 82, 73, 70, 70 };
private static readonly byte[] WMV_WMA = { 48, 38, 178, 117, 142, 102, 207, 17, 166, 217, 0, 170, 0, 98, 206, 108 };
private static readonly byte[] ZIP_DOCX = { 80, 75, 3, 4 };
public static string GetMimeType(byte[] file, string fileName)
{
string mime = "application/octet-stream"; //DEFAULT UNKNOWN MIME TYPE
//Ensure that the filename isn't empty or null
if (string.IsNullOrWhiteSpace(fileName))
{
return mime;
}
//Get the file extension
string extension = Path.GetExtension(fileName) == null
? string.Empty
: Path.GetExtension(fileName).ToUpper();
//Get the MIME Type
if (file.Take(2).SequenceEqual(BMP))
{
mime = "image/bmp";
}
else if (file.Take(8).SequenceEqual(DOC))
{
mime = "application/msword";
}
else if (file.Take(2).SequenceEqual(EXE_DLL))
{
mime = "application/x-msdownload"; //both use same mime type
}
else if (file.Take(4).SequenceEqual(GIF))
{
mime = "image/gif";
}
else if (file.Take(4).SequenceEqual(ICO))
{
mime = "image/x-icon";
}
else if (file.Take(3).SequenceEqual(JPG))
{
mime = "image/jpeg";
}
else if (file.Take(3).SequenceEqual(MP3))
{
mime = "audio/mpeg";
}
else if (file.Take(14).SequenceEqual(OGG))
{
if (extension == ".OGX")
{
mime = "application/ogg";
}
else if (extension == ".OGA")
{
mime = "audio/ogg";
}
else
{
mime = "video/ogg";
}
}
else if (file.Take(7).SequenceEqual(PDF))
{
mime = "application/pdf";
}
else if (file.Take(16).SequenceEqual(PNG))
{
mime = "image/png";
}
else if (file.Take(7).SequenceEqual(RAR))
{
mime = "application/x-rar-compressed";
}
else if (file.Take(3).SequenceEqual(SWF))
{
mime = "application/x-shockwave-flash";
}
else if (file.Take(4).SequenceEqual(TIFF))
{
mime = "image/tiff";
}
else if (file.Take(11).SequenceEqual(TORRENT))
{
mime = "application/x-bittorrent";
}
else if (file.Take(5).SequenceEqual(TTF))
{
mime = "application/x-font-ttf";
}
else if (file.Take(4).SequenceEqual(WAV_AVI))
{
mime = extension == ".AVI" ? "video/x-msvideo" : "audio/x-wav";
}
else if (file.Take(16).SequenceEqual(WMV_WMA))
{
mime = extension == ".WMA" ? "audio/x-ms-wma" : "video/x-ms-wmv";
}
else if (file.Take(4).SequenceEqual(ZIP_DOCX))
{
mime = extension == ".DOCX" ? "application/vnd.openxmlformats-officedocument.wordprocessingml.document" : "application/x-zip-compressed";
}
return mime;
}
Using file signatures it is not so feasible (since the new office formats are ZIP files and the old Office files are OLE CF / OLE SS containers), but you can use C# code to read them and figure out what they are.
For newest Office formats, you can read the (DOCX/PPTX/XLSX/...) ZIP file using System.IO.Packaging : https://msdn.microsoft.com/en-us/library/ms568187(v=vs.110).aspx
Doing that, you can find the ContentType of the first document part and infer using that.
For older Office files (Office 2003) you can use this library to distinguish them based on their contents (note that MSI and MSG files are also using this file format):
http://sourceforge.net/projects/openmcdf/
E.g., here are the contents of an XLS file:
I hope this helps! :)
It would have certainly helped me, if I had found this answer earlier. ;)
The answer from user2173353 is the most correct one, given that the OP specifically mentioned Office file formats. However, I didn't like the idea of adding an entire library (OpenMCDF) just to identify legacy Office formats, so I wrote my own routine for doing just this.
public static CfbFileFormat GetCfbFileFormat(Stream fileData)
{
if (!fileData.CanSeek)
throw new ArgumentException("Data stream must be seekable.", nameof(fileData));
try
{
// Notice that values in a CFB files are always little-endian. Fortunately BinaryReader.ReadUInt16/ReadUInt32 reads with little-endian.
// If using .net < 4.5 this BinaryReader constructor is not available. Use a simpler one but remember to also remove the 'using' statement.
using (BinaryReader reader = new BinaryReader(fileData, Encoding.Unicode, true))
{
// Check that data has the CFB file header
var header = reader.ReadBytes(8);
if (!header.SequenceEqual(new byte[] {0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1}))
return CfbFileFormat.Unknown;
// Get sector size (2 byte uint) at offset 30 in the header
// Value at 1C specifies this as the power of two. The only valid values are 9 or 12, which gives 512 or 4096 byte sector size.
fileData.Position = 30;
ushort readUInt16 = reader.ReadUInt16();
int sectorSize = 1 << readUInt16;
// Get first directory sector index at offset 48 in the header
fileData.Position = 48;
var rootDirectoryIndex = reader.ReadUInt32();
// File header is one sector wide. After that we can address the sector directly using the sector index
var rootDirectoryAddress = sectorSize + (rootDirectoryIndex * sectorSize);
// Object type field is offset 80 bytes into the directory sector. It is a 128 bit GUID, encoded as "DWORD, WORD, WORD, BYTE[8]".
fileData.Position = rootDirectoryAddress + 80;
var bits127_96 = reader.ReadInt32();
var bits95_80 = reader.ReadInt16();
var bits79_64 = reader.ReadInt16();
var bits63_0 = reader.ReadBytes(8);
var guid = new Guid(bits127_96, bits95_80, bits79_64, bits63_0);
// Compare to known file format GUIDs
CfbFileFormat result;
return Formats.TryGetValue(guid, out result) ? result : CfbFileFormat.Unknown;
}
}
catch (IOException)
{
return CfbFileFormat.Unknown;
}
catch (OverflowException)
{
return CfbFileFormat.Unknown;
}
}
public enum CfbFileFormat
{
Doc,
Xls,
Msi,
Ppt,
Unknown
}
private static readonly Dictionary<Guid, CfbFileFormat> Formats = new Dictionary<Guid, CfbFileFormat>
{
{Guid.Parse("{00020810-0000-0000-c000-000000000046}"), CfbFileFormat.Xls},
{Guid.Parse("{00020820-0000-0000-c000-000000000046}"), CfbFileFormat.Xls},
{Guid.Parse("{00020906-0000-0000-c000-000000000046}"), CfbFileFormat.Doc},
{Guid.Parse("{000c1084-0000-0000-c000-000000000046}"), CfbFileFormat.Msi},
{Guid.Parse("{64818d10-4f9b-11cf-86ea-00aa00b929e8}"), CfbFileFormat.Ppt}
};
Additional formats identifiers can be added as needed.
I've tried this on .doc and .xls, and it has worked fine. I haven't tested on CFB files using 4096 byte sector size, as I don't even know where to find those.
The code is based on information from the following documents:
http://fileformats.archiveteam.org/wiki/Microsoft_Compound_File
https://msdn.microsoft.com/en-us/library/dd942138.aspx
user2173353 has what appears to be the correct solution for detecting the new Office .docx / .xlsx formats.
To add some details to this, the below check appears to identify these correctly:
/// <summary>
/// MS .docx, .xslx and other extensions are (correctly) identified as zip files using signature lookup.
/// This tests if System.IO.Packaging is able to open, and if package has parts, this is not a zip file.
/// </summary>
/// <param name="stream"></param>
/// <returns></returns>
private static bool IsPackage(this Stream stream)
{
Package package = Package.Open(stream, FileMode.Open, FileAccess.Read);
return package.GetParts().Any();
}
I need to run a stored procedure from code. One of the input parameters is rowVersion of the table. rowVersion is a byte array ( {0, 0, 0, 0, 0, 0, 13, 191} that's 0x0000000000000DBF in db). So if to add rowVersion this way :
cmd.Parameters.AddWithValue("#myPKRowversion", 0x0000000000000DBF);
my sp is working. But when I'm adding it like here:
uint a = 0x0000000000000DBF;
cmd.Parameters.AddWithValue("#myPKRowversion", a);
or if I convert byte Array to string like:
string a = "0x0000000000000DBF";
cmd.Parameters.AddWithValue("#myPKRowversion", a);
my sp is not working.
What should I do to make my sp work?
I suggest you add it as a byte array. For example:
byte[] bytes = new byte[] { 0, 0, 0, 0, 0, 0, 13, 191 };
cmd.Parameters.Add("#myPKRowVersion", SqlDbType.Binary).Value = bytes;
If you're trying to specify bytes, the most natural type is a byte array...
I'm working on a solution to my other question which is reading the data in the 'zTXt' chunks of a PNG. I am as far as locating the chunks in the file, and reading the zTXt's keyword. I'm having trouble reading the compressed portion of zTXt. I've never worked with the DeflateStream object before, and am having some trouble with it. When reading, it appears to expect the length parameter to be in 'uncompressed' bytes. In my case however, I only know the length of the data in 'compressed' bytes. To hopefully get around this, I put all the data that needed to be decompressed into a MemoryStream, and then 'read to end' with a DeflateStream. Now that's just peachy, except it throws an InvalidDataException with the message "Block length does not match with its complement." Now I have no idea what this means. What could be going wrong?
The format of a chunk is 4 bytes for the ID ("zTXt"), a big-endian 32-bit int for the data length, the data, and finally a CRC32 checksum which I am ignoring for now.
The format of the zTXt chunk is first a null-terminated (string as a keyword), then one byte for the compression method (always 0, the DEFLATE method), with the rest of the data being compressed text.
My method takes in a fresh FileStream, and returns a dictionary with the zTXt keywords and data.
Here is the monster now:
public static List<KeyValuePair<string, string>> GetZtxt(FileStream stream)
{
var ret = new List<KeyValuePair<string, string>>();
try {
stream.Position = 0;
var br = new BinaryReader(stream, Encoding.ASCII);
var head = br.ReadBytes(8); // The header is the same for all PNGs.
if (!head.SequenceEqual(new byte[] { 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A })) return null; // Not a PNG.
while (stream.Position < stream.Length) {
int len; // Length of chunk data.
if (BitConverter.IsLittleEndian)
len = BitConverter.ToInt32(br.ReadBytes(4).Reverse().ToArray(), 0);
else
len = br.ReadInt32();
char[] cName = br.ReadChars(4); // The chunk type.
if (cName.SequenceEqual(new[] { 'z', 'T', 'X', 't' })) {
var sb = new StringBuilder(); // Builds the null-terminated keyword associated with the chunk.
char c = br.ReadChar();
do {
sb.Append(c);
c = br.ReadChar();
}
while (c != '\0');
byte method = br.ReadByte(); // The compression method. Should always be 0. (DEFLATE method.)
if (method != 0) {
stream.Seek(len - sb.Length + 3, SeekOrigin.Current); // If not 0, skip the rest of the chunk.
continue;
}
var data = br.ReadBytes(len - sb.Length - 1); // Rest of the chunk data...
var ms = new MemoryStream(data, 0, data.Length); // ...in a MemoryStream...
var ds = new DeflateStream(ms, CompressionMode.Decompress); // ...read by a DeflateStream...
var sr = new StreamReader(ds); // ... and a StreamReader. Yeesh.
var str = sr.ReadToEnd(); // !!! InvalidDataException !!!
ret.Add(new KeyValuePair<string, string>(sb.ToString(), str));
stream.Seek(4, SeekOrigin.Current); // Skip the CRC check.
}
else {
stream.Seek(len + 4, SeekOrigin.Current); // Skip the rest of the chunk.
}
}
}
catch (IOException) { }
catch (InvalidDataException) { }
catch (ArgumentOutOfRangeException) { }
return ret;
}
Once this is tackled, I'll need to write a function that ADDS these zTXt chunks to the file. So hopefully I'll understand how the DeflateStream works once this is solved.
Thanks, much!!
After all this time, I've finally found the problem. The data is in zlib format, which has a bit more data stored than just using DEFLATE alone. The file is read properly if I just read the 2 extra bytes in right before I get the compressed data.
See this feedback page. (I did not submit that one.)
I'm wondering now. The value of those two bytes are 0x78 and 0x9C respectively. If I find values other than those, should I assume the DEFLATE is going to fail?