byte[] header = new byte[]{255, 216};
string ascii = Encoding.ASCII.GetString(header);
I expect ASCII to be equal to be FFD8 (JPEG SOI marker)
Instead I get "????"
In this case you'd be better to compare the byte arrays rather than converting to string.
If you must convert to string, I suggest using the encoding Latin-1 aka ISO-8859-1 aka Code Page 28591 encoding, as this encoding will map all bytes with hex values are in the range 0-255 to the Unicode character with the same hex value - convenient for this scenario. Any of the following will get this encoding:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
Yes, that's because ASCII is only 7-bit - it doesn't define any values above 127. Encodings typically decode unknown binary values to '?' (although this can be changed using DecoderFallback).
If you're about to mention "extended ASCII" I suspect you actually want Encoding.Default which is "the default code page for the operating system"... code page 1252 on most Western systems, I believe.
What characters were you expecting?
EDIT: As per the accepted answer (I suspect the question was edited after I added my answer; I don't recall seeing anything about JPEG originally) you shouldn't convert binary data to text unless it's genuinely encoded text data. JPEG data is binary data - so you should be checking the actual bytes against the expected bytes.
Any time you convert arbitrary binary data (such as images, music or video) into text using a "plain" text encoding (such as ASCII, UTF-8 etc) you risk data loss. If you have to convert it to text, use Base64 which is nice and safe. If you just want to compare it with expected binary data, however, it's best not to convert it to text at all.
EDIT: Okay, here's a class to help image detection method for a given byte array. I haven't made it HTTP-specific; I'm not entirely sure whether you should really fetch the InputStream, read just a bit of it, and then fetch the stream again. I've ducked the issue by sticking to byte arrays :)
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Linq;
public sealed class SignatureDetector
{
public static readonly SignatureDetector Png =
new SignatureDetector(0x89, 0x50, 0x4e, 0x47);
public static readonly SignatureDetector Bmp =
new SignatureDetector(0x42, 0x4d);
public static readonly SignatureDetector Gif =
new SignatureDetector(0x47, 0x49, 0x46);
public static readonly SignatureDetector Jpeg =
new SignatureDetector(0xff, 0xd8);
public static readonly IEnumerable<SignatureDetector> Images =
new ReadOnlyCollection<SignatureDetector>(new[]{Png, Bmp, Gif, Jpeg});
private readonly byte[] bytes;
public SignatureDetector(params byte[] bytes)
{
if (bytes == null)
{
throw new ArgumentNullException("bytes");
}
this.bytes = (byte[]) bytes.Clone();
}
public bool Matches(byte[] data)
{
if (data == null)
{
throw new ArgumentNullException("data");
}
if (data.Length < bytes.Length)
{
return false;
}
for (int i=0; i < bytes.Length; i++)
{
if (data[i] != bytes[i])
{
return false;
}
}
return true;
}
// Convenience method
public static bool IsImage(byte[] data)
{
return Images.Any(detector => detector.Matches(data));
}
}
If you then wrote:
Console.WriteLine(ascii)
And expected "FFD8" to print out, that's not the way GetString work. For that, you would need:
string ascii = String.Format("{0:X02}{1:X02}", header[0], header[1]);
I once wrote a custom encoder/decoder that encoded bytes 0-255 to unicode characters 0-255 and back again.
It was only really useful for using string functions on something that isn't actually a string.
Are you sure "????" is the result?
What is the result of:
(int)ascii[0]
(int)ascii[1]
On the other hand, pure ASCII is 0-127 only...
Related
From a call to an external API my method receives an image/png as an IRestResponse where the Content property is a string representation.
I need to convert this string representation of image/png into a byte array without saving it first and then going File.ReadAllBytes. How can I achieve this?
You can try a hex string to byte conversion. Here is a method I've used before. Please note that you may have to pad the byte depending on how it comes. The method will throw an error to let you know this. However if whoever sent the image converted it into bytes, then into a hex string (which they should based on what you are saying) then you won't have to worry about padding.
public static byte[] HexToByte(string HexString)
{
if (HexString.Length % 2 != 0)
throw new Exception("Invalid HEX");
byte[] retArray = new byte[HexString.Length / 2];
for (int i = 0; i < retArray.Length; ++i)
{
retArray[i] = byte.Parse(HexString.Substring(i * 2, 2), NumberStyles.HexNumber, CultureInfo.InvariantCulture);
}
return retArray;
}
This might not be the fastest solution by the way, but its a good representation of what needs to happen so you can optimize later.
This is also assuming the string being sent to you is the raw byte converted string. If the sender did anything like a base58 conversion or something else you will need to decode and then use method.
I have found that the IRestResponse from RestSharp actually contains a 'RawBytes' property which is of the response content. This meets my needs and no conversion is necessary!
byte[] header = new byte[]{255, 216};
string ascii = Encoding.ASCII.GetString(header);
I expect ASCII to be equal to be FFD8 (JPEG SOI marker)
Instead I get "????"
In this case you'd be better to compare the byte arrays rather than converting to string.
If you must convert to string, I suggest using the encoding Latin-1 aka ISO-8859-1 aka Code Page 28591 encoding, as this encoding will map all bytes with hex values are in the range 0-255 to the Unicode character with the same hex value - convenient for this scenario. Any of the following will get this encoding:
Encoding.GetEncoding(28591)
Encoding.GetEncoding("Latin1")
Encoding.GetEncoding("ISO-8859-1")
Yes, that's because ASCII is only 7-bit - it doesn't define any values above 127. Encodings typically decode unknown binary values to '?' (although this can be changed using DecoderFallback).
If you're about to mention "extended ASCII" I suspect you actually want Encoding.Default which is "the default code page for the operating system"... code page 1252 on most Western systems, I believe.
What characters were you expecting?
EDIT: As per the accepted answer (I suspect the question was edited after I added my answer; I don't recall seeing anything about JPEG originally) you shouldn't convert binary data to text unless it's genuinely encoded text data. JPEG data is binary data - so you should be checking the actual bytes against the expected bytes.
Any time you convert arbitrary binary data (such as images, music or video) into text using a "plain" text encoding (such as ASCII, UTF-8 etc) you risk data loss. If you have to convert it to text, use Base64 which is nice and safe. If you just want to compare it with expected binary data, however, it's best not to convert it to text at all.
EDIT: Okay, here's a class to help image detection method for a given byte array. I haven't made it HTTP-specific; I'm not entirely sure whether you should really fetch the InputStream, read just a bit of it, and then fetch the stream again. I've ducked the issue by sticking to byte arrays :)
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Linq;
public sealed class SignatureDetector
{
public static readonly SignatureDetector Png =
new SignatureDetector(0x89, 0x50, 0x4e, 0x47);
public static readonly SignatureDetector Bmp =
new SignatureDetector(0x42, 0x4d);
public static readonly SignatureDetector Gif =
new SignatureDetector(0x47, 0x49, 0x46);
public static readonly SignatureDetector Jpeg =
new SignatureDetector(0xff, 0xd8);
public static readonly IEnumerable<SignatureDetector> Images =
new ReadOnlyCollection<SignatureDetector>(new[]{Png, Bmp, Gif, Jpeg});
private readonly byte[] bytes;
public SignatureDetector(params byte[] bytes)
{
if (bytes == null)
{
throw new ArgumentNullException("bytes");
}
this.bytes = (byte[]) bytes.Clone();
}
public bool Matches(byte[] data)
{
if (data == null)
{
throw new ArgumentNullException("data");
}
if (data.Length < bytes.Length)
{
return false;
}
for (int i=0; i < bytes.Length; i++)
{
if (data[i] != bytes[i])
{
return false;
}
}
return true;
}
// Convenience method
public static bool IsImage(byte[] data)
{
return Images.Any(detector => detector.Matches(data));
}
}
If you then wrote:
Console.WriteLine(ascii)
And expected "FFD8" to print out, that's not the way GetString work. For that, you would need:
string ascii = String.Format("{0:X02}{1:X02}", header[0], header[1]);
I once wrote a custom encoder/decoder that encoded bytes 0-255 to unicode characters 0-255 and back again.
It was only really useful for using string functions on something that isn't actually a string.
Are you sure "????" is the result?
What is the result of:
(int)ascii[0]
(int)ascii[1]
On the other hand, pure ASCII is 0-127 only...
I'm trying to convert a string to a byte[] using the ASCIIEncoder object in the .NET library. The string will never contain non-ASCII characters, but it will usually have a length greater than 16. My code looks like the following:
public static byte[] Encode(string packet)
{
ASCIIEncoder enc = new ASCIIEncoder();
byte[] byteArray = enc.GetBytes(packet);
return byteArray;
}
By the end of the method, the byte array should be full of packet.Length number of bytes, but Intellisense tells me that all bytes after byteArray[15] are literally questions marks that cannot be observed. I used Wireshark to view byteArray after I sent it and it was received on the other side fine, but the end device did not follow the instructions encoded in byteArray. I'm wondering if this has anything to do with Intellisense not being able to display all elements in byteArray, or if my packet is completely wrong.
If your packet string basically contains characters in the range 0-255, then ASCIIEncoding is not what you should be using. ASCII only defines character codes 0-127; anything in the range 128-255 will get turned into question marks (as you have observed) because there characters are not defined in ASCII.
Consider using a method like this to convert the string to a byte array. (This assumes that the ordinal value of each character is in the range 0-255 and that the ordinal value is what you want.)
public static byte[] ToOrdinalByteArray(this string str)
{
if (str == null) { throw new ArgumentNullException("str"); }
var bytes = new byte[str.Length];
for (int i = 0; i < str.Length; ++i) {
// Wrapping the cast in checked() will trigger an OverflowException
// if the character being converted is out of range for a byte.
bytes[i] = checked((byte)str[i]);
}
return bytes;
}
The Encoding class hierarchy is specifically designed for handling text. What you have here doesn't seem to be text, so you should avoid using these classes.
The standard encoders use the replacement character fallback strategy. If a character doesn't exist in the target character set, they encode a replacement character ('?' by default).
To me, that's worse than a silent failure; It's data corruption. I prefer that libraries tell me when my assumptions are wrong.
You can derive an encoder that throws an exception:
Encoding.GetEncoding(
"us-ascii",
new EncoderExceptionFallback(),
new DecoderExceptionFallback());
If you are truly using only characters in Unicode's ASCII range then you'll never see an exception.
I'm reading a file into byte[] buffer. The file contains a lot of UTF-16 strings (millions) in the following format:
The first byte contain and string length in chars (range 0 .. 255)
The following bytes contains the string characters in UTF-16 encoding (each char represented by 2 bytes, means byteCount = charCount * 2).
I need to perform standard string operations for all strings in the file, for example: IndexOf, EndsWith and StartsWith, with StringComparison.OrdinalIgnoreCase and StringComparison.Ordinal.
For now my code first converting each string from byte array to System.String type. I found the following code to be the most efficient to do so:
// position/length validation removed to minimize the code
string result;
byte charLength = _buffer[_bufferI++];
int byteLength = charLength * 2;
fixed (byte* pBuffer = &_buffer[_bufferI])
{
result = new string((char*)pBuffer, 0, charLength);
}
_bufferI += byteLength;
return result;
Still, new string(char*, int, int) it's very slow because it performing unnecessary copying for each string.
Profiler says its System.String.wstrcpy(char*,char*,int32) performing slow.
I need a way to perform string operations without copying bytes for each string.
Is there a way to perform string operations on byte array directly?
Is there a way to create new string without copying its bytes?
No, you can't create a string without copying the character data.
The String object stores the meta data for the string (Length, et.c.) in the same memory area as the character data, so you can't keep the character data in the byte array and pretend that it's a String object.
You could try other ways of constructing the string from the byte data, and see if any of them has less overhead, like Encoding.UTF16.GetString.
If you are using a pointer, you could try to get multiple strings at a time, so that you don't have to fix the buffer for each string.
You could read the File using a StreamReader using Encoding.UTF16 so you do not have the "byte overhead" in between:
using (StreamReader sr = new StreamReader(filename, Encoding.UTF16))
{
string line;
while ((line = sr.ReadLine()) != null)
{
//Your Code
}
}
You could create extension methods on byte arrays to handle most of those string operations directly on the byte array and avoid the cost of converting. Not sure what all string operations you perform, so not sure if all of them could be accomplished this way.
This question already has answers here:
C# - Check if File is Text Based
(6 answers)
Closed 7 years ago.
I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?
There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a 64Kb (256x256) profile you can compare your runtime files against (within a % threshold).
Supposedly, this is how browsers' Auto-Detect Encoding feature works.
I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.
As stated though, you can always be unlucky and get a binary file that looks like text or vice versa.
Sharing my solution in the hope it helps others as it helps me from these posts and forums.
Background
I have been researching and exploring a solution for the same. However, I expected it to be simple or slightly twisted.
However, most of the attempts provide convoluted solutions here as well as other sources and dives into Unicode, UTF-series, BOM, Encodings, Byte orders. In the process, I also went off-road and into Ascii Tables and Code pages too.
Anyways, I have come up with a solution based on the idea of stream reader and custom control characters check.
It is built taking into considerations various hints and tips provided on the forum and elsewhere such as:
Check for lot of control characters for example looking for multiple consecutive null characters.
Check for UTF, Unicode, Encodings, BOM, Byte Orders and similar aspects.
My goal is:
It should not rely on byte orders, encodings and other more involved esoteric work.
It should be relatively easy to implement and easy to understand.
It should work on all types of files.
The solution presented works for me on test data that includes mp3, eml, txt, info, flv, mp4, pdf, gif, png, jpg. It gives results as expected so far.
How the solution works
I am relying on the StreamReader default constructor to do what it can do best with respect to determining file encoding related characteristics which uses UTF8Encoding by default.
I created my own version of check for custom control char condition because Char.IsControl does not seem useful. It says:
Control characters are formatting and other non-printing characters,
such as ACK, BEL, CR, FF, LF, and VT. Unicode standard assigns code
points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to
control characters. These values are to be interpreted as control
characters unless their use is otherwise defined by an application. It
considers LF and CR as control characters among other things
That makes it not useful since text files include CR and LF at least.
Solution
static void testBinaryFile(string folderPath)
{
List<string> output = new List<string>();
foreach (string filePath in getFiles(folderPath, true))
{
output.Add(isBinary(filePath).ToString() + " ---- " + filePath);
}
Clipboard.SetText(string.Join("\n", output), TextDataFormat.Text);
}
public static List<string> getFiles(string path, bool recursive = false)
{
return Directory.Exists(path) ?
Directory.GetFiles(path, "*.*",
recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly).ToList() :
new List<string>();
}
public static bool isBinary(string path)
{
long length = getSize(path);
if (length == 0) return false;
using (StreamReader stream = new StreamReader(path))
{
int ch;
while ((ch = stream.Read()) != -1)
{
if (isControlChar(ch))
{
return true;
}
}
}
return false;
}
public static bool isControlChar(int ch)
{
return (ch > Chars.NUL && ch < Chars.BS)
|| (ch > Chars.CR && ch < Chars.SUB);
}
public static class Chars
{
public static char NUL = (char)0; // Null char
public static char BS = (char)8; // Back Space
public static char CR = (char)13; // Carriage Return
public static char SUB = (char)26; // Substitute
}
If you try above solution, let me know it works for you or not.
Other interesting and related links:
About UTF and BOM on Unicode.org
Unicode sample files
How to detect encoding of text file and
Detect file encoding in Csharp
While this isn't foolproof, this should check to see if it has any binary content.
public bool HasBinaryContent(string content)
{
return content.Any(ch => char.IsControl(ch) && ch != '\r' && ch != '\n');
}
Because if any control character exist (aside from the standard \r\n), then it is probably not a text file.
If the real question here is "Can this file be read and written using StreamReader/StreamWriter without modification?", then the answer is here:
/// <summary>
/// Detect if a file is text and detect the encoding.
/// </summary>
/// <param name="encoding">
/// The detected encoding.
/// </param>
/// <param name="fileName">
/// The file name.
/// </param>
/// <param name="windowSize">
/// The number of characters to use for testing.
/// </param>
/// <returns>
/// true if the file is text.
/// </returns>
public static bool IsText(out Encoding encoding, string fileName, int windowSize)
{
using (var fileStream = File.OpenRead(fileName))
{
var rawData = new byte[windowSize];
var text = new char[windowSize];
var isText = true;
// Read raw bytes
var rawLength = fileStream.Read(rawData, 0, rawData.Length);
fileStream.Seek(0, SeekOrigin.Begin);
// Detect encoding correctly (from Rick Strahl's blog)
// http://www.west-wind.com/weblog/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader
if (rawData[0] == 0xef && rawData[1] == 0xbb && rawData[2] == 0xbf)
{
encoding = Encoding.UTF8;
}
else if (rawData[0] == 0xfe && rawData[1] == 0xff)
{
encoding = Encoding.Unicode;
}
else if (rawData[0] == 0 && rawData[1] == 0 && rawData[2] == 0xfe && rawData[3] == 0xff)
{
encoding = Encoding.UTF32;
}
else if (rawData[0] == 0x2b && rawData[1] == 0x2f && rawData[2] == 0x76)
{
encoding = Encoding.UTF7;
}
else
{
encoding = Encoding.Default;
}
// Read text and detect the encoding
using (var streamReader = new StreamReader(fileStream))
{
streamReader.Read(text, 0, text.Length);
}
using (var memoryStream = new MemoryStream())
{
using (var streamWriter = new StreamWriter(memoryStream, encoding))
{
// Write the text to a buffer
streamWriter.Write(text);
streamWriter.Flush();
// Get the buffer from the memory stream for comparision
var memoryBuffer = memoryStream.GetBuffer();
// Compare only bytes read
for (var i = 0; i < rawLength && isText; i++)
{
isText = rawData[i] == memoryBuffer[i];
}
}
}
return isText;
}
}
Great question! I was surprised myself that .NET does not provide an easy solution for this.
The following code worked for me to distinguish between images (png, jpg etc) and text files.
I just checked for consecutive nulls (0x00) in the first 512 bytes, as per suggestions by Ron Warholic and Adam Bruss:
if (File.Exists(path))
{
// Is it binary? Check for consecutive nulls..
byte[] content = File.ReadAllBytes(path);
for (int i = 1; i < 512 && i < content.Length; i++) {
if (content[i] == 0x00 && content[i-1] == 0x00) {
return Convert.ToBase64String(content);
}
}
// No? return text
return File.ReadAllText(path);
}
Obviously this is a quick-and-dirty approach, however it can be easily expanded by breaking the file into 10 chunks of 512 bytes each and check 8 one of the them for consecutive nulls (personally, I would deduce its a binary file if 2 or 3 of them match - nulls are rare in text files).
That should provide a pretty good solution for what you are after.
Quick and dirty is to use the file extension and look for common, text extensions such as .txt. For this, you can use the Path.GetExtension call. Anything else would not really be classed as "quick", though it may well be dirty.
A really really really dirty way would be to build a regex that takes only standard text, punctuation, symbols, and whitespace characters, load up a portion of the file in a text stream, then run it against the regex. Depending on what qualifies as a pure text file in your problem domain, no successful matches would indicate a binary file.
To account for unicode, make sure to mark the encoding on your stream as such.
This is really suboptimal, but you said quick and dirty.
http://codesnipers.com/?q=node/68 describes how to detect UTF-16 vs. UTF-8 using a Byte Order Mark (which may appear in your file). It also suggests looping through some bytes to see if they conform to the UTF-8 multi-byte sequence pattern (below) to determine if your file is a text file.
0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >=
0x400
11110xxx 10xxxxxx 10xxxxxx
10xxxxxx 4-byte >= 0x10000
How about another way: determine length of binary array, representing file's contents and compare it with length of string you will have after converting given binary array to text.
If length the same, there are no "none-readable' symbols in file, it's text (I'm sure on 80%).
Another way is to detect the file's charset using UDE. If charset detected successfully, you can be sure that it's text, otherwise it's binary. Because binary has no charset.
Of course you can use other charset detecting library other than UDE. If the charset detecting library is good enough, this approach could achieve 100% correctness.