I would like to read a DICOM file in C#. I don't want to do anything fancy, I just for now would like to know how to read in the elements, but first I would actually like to know how to read the header to see if is a valid DICOM file.
It consists of Binary Data Elements. The first 128 bytes are unused (set to zero), followed by the string 'DICM'. This is followed by header information, which is organized into groups.
A sample DICOM header
First 128 bytes: unused DICOM format.
Followed by the characters 'D','I','C','M'
Followed by extra header information such as:
0002,0000, File Meta Elements Groups Len: 132
0002,0001, File Meta Info Version: 256
0002,0010, Transfer Syntax UID: 1.2.840.10008.1.2.1.
0008,0000, Identifying Group Length: 152
0008,0060, Modality: MR
0008,0070, Manufacturer: MRIcro
In the above example, the header is organized into groups. The group 0002 hex is the file meta information group which contains 3 elements: one defines the group length, one stores the file version and the their stores the transfer syntax.
Questions
How to I read the header file and verify if it is a DICOM file by checking for the 'D','I','C','M' characters after the 128 byte preamble?
How do I continue to parse the file reading the other parts of the data?
Something like this should read the file, its basic and doesn't handle all cases, but it would be a starting point:
public void ReadFile(string filename)
{
using (FileStream fs = File.OpenRead(filename))
{
fs.Seek(128, SeekOrigin.Begin);
if ((fs.ReadByte() != (byte)'D' ||
fs.ReadByte() != (byte)'I' ||
fs.ReadByte() != (byte)'C' ||
fs.ReadByte() != (byte)'M'))
{
Console.WriteLine("Not a DCM");
return;
}
BinaryReader reader = new BinaryReader(fs);
ushort g;
ushort e;
do
{
g = reader.ReadUInt16();
e = reader.ReadUInt16();
string vr = new string(reader.ReadChars(2));
long length;
if (vr.Equals("AE") || vr.Equals("AS") || vr.Equals("AT")
|| vr.Equals("CS") || vr.Equals("DA") || vr.Equals("DS")
|| vr.Equals("DT") || vr.Equals("FL") || vr.Equals("FD")
|| vr.Equals("IS") || vr.Equals("LO") || vr.Equals("PN")
|| vr.Equals("SH") || vr.Equals("SL") || vr.Equals("SS")
|| vr.Equals("ST") || vr.Equals("TM") || vr.Equals("UI")
|| vr.Equals("UL") || vr.Equals("US"))
length = reader.ReadUInt16();
else
{
// Read the reserved byte
reader.ReadUInt16();
length = reader.ReadUInt32();
}
byte[] val = reader.ReadBytes((int) length);
} while (g == 2);
fs.Close();
}
return ;
}
The code does not actually try and take into account that the transfer syntax of the encoded data can change after the group 2 elements, it also doesn't try and do anything with the actual values read in.
Just some pseudologic
How to I read the header file and verify if it is a DICOM file by checking for the 'D','I','C','M' characters after the 128 byte preamble?
Open as binary file, using File.OpenRead
Seek to position 128 and read 4 bytes into the array and compare it againts byte[] value for DICM. You can use ASCIIEncoding.GetBytes() for that
How do I continue to parse the file reading the other parts of the data?
Continue reading the file using Read or ReadByte using the FileStream object handle that you have earlier
Use the same method like above to do your comparison.
Dont forget to close and dispose the file.
you can also use like this.
FileStream fs = File.OpenRead(path);
byte[] data = new byte[132];
fs.Read(data, 0, data.Length);
int b0 = data[0] & 255, b1 = data[1] & 255, b2 = data[2] & 255, b3 = data[3] & 255;
if (data[128] == 68 && data[129] == 73 && data[130] == 67 && data[131] == 77)
{
//dicom file
}
else if ((b0 == 8 || b0 == 2) && b1 == 0 && b3 == 0)
{
//dicom file
}
Taken from EvilDicom.Helper.DicomReader from the Evil Dicom library:
public static bool IsValidDicom(BinaryReader r)
{
try
{
//128 null bytes
byte[] nullBytes = new byte[128];
r.Read(nullBytes, 0, 128);
foreach (byte b in nullBytes)
{
if (b != 0x00)
{
//Not valid
Console.WriteLine("Missing 128 null bit preamble. Not a valid DICOM file!");
return false;
}
}
}
catch (Exception)
{
Console.WriteLine("Could not read 128 null bit preamble. Perhaps file is too short");
return false;
}
try
{
//4 DICM characters
char[] dicm = new char[4];
r.Read(dicm, 0, 4);
if (dicm[0] != 'D' || dicm[1] != 'I' || dicm[2] != 'C' || dicm[3] != 'M')
{
//Not valid
Console.WriteLine("Missing characters D I C M in bits 128-131. Not a valid DICOM file!");
return false;
}
return true;
}
catch (Exception)
{
Console.WriteLine("Could not read DICM letters in bits 128-131.");
return false;
}
}
Related
I have a Hex string that I am converting into characters. This is the function I am using.
public string GetAsciiString(bool replaceNewline = true)
{
char[] chars = new char[data.Length + 1];
byte[] bytes = new byte[data.Length + 1];
bytes[0] = opcode;
Array.Copy(data, 0, bytes, 1, data.Length);
for (int i = 0; i < bytes.Length; ++i)
{
byte value = bytes[i];
if ((value == '\n' || value == '\r') && !replaceNewline)
chars[i] = (char)value;
else if (value < 32 || value > 126)
chars[i] = '.';
else chars[i] = (char)value;
}
return new string(chars);
}
However this only displays english characters and not korean. Any idea on how I can get it to display Korean?
Edit: I see the issue was that I was converting to Ascii.
Okay, there are multiple ways to go at this.
The operation to turn byte arrays into strings is known as decoding. In .NET, this is done with the Encoding class. In your case, you must first find the corresponding encoding. Looking at the documentation linked above, encoding ks_c_5601-1987 corresponds to code page 949, so:
var encoding = Encoding.GetEncoding(949);
Once you have that encoding, you can use it to decode the bytes:
var text = encoding.GetString(bytes);
This leaves the matter of those bytes. If you already have a byte array, you're good to go. If you have a hex string, I suggest you look into existing questions for this, for instance: https://stackoverflow.com/a/311165/.
If your data is in a stream, I recommend you use a StreamReader for this instead:
using(var reader = new StreamReader(stream, encoding)) // use encoding to decode from stream
{
var text = reader.ReadToEnd();
}
I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .
if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .
Is there any Code or API for it ?
i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .
also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.
but i should get same result from both ways , so where the problem has occurred?
Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.
Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.
So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.
Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
// UTF8 BOM found; use encUtf8Bom to decode.
try
{
// Seems that despite being an encoding with preamble,
// it doesn't actually skip said preamble when decoding...
text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
encoding = encUtf8Bom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
couldBeUtf8 = false;
}
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
// test UTF-8 on strict encoding rules. Note that on pure ASCII this will
// succeed as well, since valid ASCII is automatically valid UTF-8.
UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
try
{
text = encUtf8NoBom.GetString(bytes);
encoding = encUtf8NoBom;
}
catch (ArgumentException)
{
// Confirmed as not UTF-8!
}
}
// fall back to default ANSI encoding.
if (encoding == null)
{
encoding = Encoding.GetEncoding(1252);
text = encoding.GetString(bytes);
}
Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.
Necromancing.
First, you check the Byte-Order Mark:
If that doesn't work, you can try to infer the encoding from the text-content with Mozilla Universal Charset Detector C# port.
If that doesn't work, you just return the CurrentCulture/InstalledUiCulture/System-Encoding - or whatever.
if the system-encoding doesn't work, we can either return ASCII or UTF8. Since entries 0-127 of UTF8 are identical to ASCII, we so simply return UTF8.
Example (DetectOrGuessEncoding):
namespace SQLMerge
{
class EncodingDetector
{
public static System.Text.Encoding BomInfo(string srcFile)
{
return BomInfo(srcFile, false);
} // End Function BomInfo
public static System.Text.Encoding BomInfo(string srcFile, bool thorough)
{
byte[] b = new byte[5];
using (System.IO.FileStream file = new System.IO.FileStream(srcFile, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
{
int numRead = file.Read(b, 0, 5);
if (numRead < 5)
System.Array.Resize(ref b, numRead);
file.Close();
} // End Using file
if (b.Length >= 4 && b[0] == 0x00 && b[1] == 0x00 && b[2] == 0xFE && b[3] == 0xFF) // UTF32-BE
return System.Text.Encoding.GetEncoding("utf-32BE"); // UTF-32, big-endian
else if (b.Length >= 4 && b[0] == 0xFF && b[1] == 0xFE && b[2] == 0x00 && b[3] == 0x00) // UTF32-LE
return System.Text.Encoding.UTF32; // UTF-32, little-endian
// https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-14
else if (b.Length >= 4 && b[0] == 0x2b && b[1] == 0x2f && b[2] == 0x76 && (b[3] == 0x38 || b[3] == 0x39 || b[3] == 0x2B || b[3] == 0x2F)) // UTF7
return System.Text.Encoding.UTF7; // UTF-7
else if (b.Length >= 3 && b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF) // UTF-8
return System.Text.Encoding.UTF8; // UTF-8
else if (b.Length >= 2 && b[0] == 0xFE && b[1] == 0xFF) // UTF16-BE
return System.Text.Encoding.BigEndianUnicode; // UTF-16, big-endian
else if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE) // UTF16-LE
return System.Text.Encoding.Unicode; // UTF-16, little-endian
// Maybe there is a future encoding ...
// PS: The above yields more than this - this doesn't find UTF7 ...
if (thorough)
{
System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>> lsPreambles =
new System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>>();
foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
{
System.Text.Encoding enc = ei.GetEncoding();
byte[] preamble = enc.GetPreamble();
if (preamble == null)
continue;
if (preamble.Length == 0)
continue;
if (preamble.Length > b.Length)
continue;
System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp =
new System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>(enc, preamble);
lsPreambles.Add(kvp);
} // Next ei
// li.Sort((a, b) => a.CompareTo(b)); // ascending sort
// li.Sort((a, b) => b.CompareTo(a)); // descending sort
lsPreambles.Sort(
delegate (
System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp1,
System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp2)
{
return kvp2.Value.Length.CompareTo(kvp1.Value.Length);
}
);
for (int j = 0; j < lsPreambles.Count; ++j)
{
for (int i = 0; i < lsPreambles[j].Value.Length; ++i)
{
if (b[i] != lsPreambles[j].Value[i])
{
goto NEXT_J_AND_NOT_NEXT_I;
}
} // Next i
return lsPreambles[j].Key;
NEXT_J_AND_NOT_NEXT_I: continue;
} // Next j
} // End if (thorough)
return null;
} // End Function BomInfo
public static System.Text.Encoding DetectOrGuessEncoding(string fileName)
{
return DetectOrGuessEncoding(fileName, false);
}
public static System.Text.Encoding DetectOrGuessEncoding(string fileName, bool withOutput)
{
if (!System.IO.File.Exists(fileName))
return null;
System.ConsoleColor origBack = System.ConsoleColor.Black;
System.ConsoleColor origFore = System.ConsoleColor.White;
if (withOutput)
{
origBack = System.Console.BackgroundColor;
origFore = System.Console.ForegroundColor;
}
// System.Text.Encoding systemEncoding = System.Text.Encoding.Default; // Returns hard-coded UTF8 on .NET Core ...
System.Text.Encoding systemEncoding = GetSystemEncoding();
System.Text.Encoding enc = BomInfo(fileName);
if (enc != null)
{
if (withOutput)
{
System.Console.BackgroundColor = System.ConsoleColor.Green;
System.Console.ForegroundColor = System.ConsoleColor.White;
System.Console.WriteLine(fileName);
System.Console.WriteLine(enc);
System.Console.BackgroundColor = origBack;
System.Console.ForegroundColor = origFore;
}
return enc;
}
using (System.IO.Stream strm = System.IO.File.OpenRead(fileName))
{
UtfUnknown.DetectionResult detect = UtfUnknown.CharsetDetector.DetectFromStream(strm);
if (detect != null && detect.Details != null && detect.Details.Count > 0 && detect.Details[0].Confidence < 1)
{
if (withOutput)
{
System.Console.BackgroundColor = System.ConsoleColor.Red;
System.Console.ForegroundColor = System.ConsoleColor.White;
System.Console.WriteLine(fileName);
System.Console.WriteLine(detect);
System.Console.BackgroundColor = origBack;
System.Console.ForegroundColor = origFore;
}
foreach (UtfUnknown.DetectionDetail detail in detect.Details)
{
if (detail.Encoding == systemEncoding
|| detail.Encoding == System.Text.Encoding.UTF8
)
return detail.Encoding;
}
return detect.Details[0].Encoding;
}
else if (detect != null && detect.Details != null && detect.Details.Count > 0)
{
if (withOutput)
{
System.Console.BackgroundColor = System.ConsoleColor.Green;
System.Console.ForegroundColor = System.ConsoleColor.White;
System.Console.WriteLine(fileName);
System.Console.WriteLine(detect);
System.Console.BackgroundColor = origBack;
System.Console.ForegroundColor = origFore;
}
return detect.Details[0].Encoding;
}
enc = GetSystemEncoding();
if (withOutput)
{
System.Console.BackgroundColor = System.ConsoleColor.DarkRed;
System.Console.ForegroundColor = System.ConsoleColor.Yellow;
System.Console.WriteLine(fileName);
System.Console.Write("Assuming ");
System.Console.Write(enc.WebName);
System.Console.WriteLine("...");
System.Console.BackgroundColor = origBack;
System.Console.ForegroundColor = origFore;
}
return systemEncoding;
} // End Using strm
} // End Function DetectOrGuessEncoding
public static System.Text.Encoding GetSystemEncoding()
{
// The OEM code page for use by legacy console applications
// int oem = System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage;
// The ANSI code page for use by legacy GUI applications
// int ansi = System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage; // Machine
int ansi = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage; // User
try
{
// https://stackoverflow.com/questions/38476796/how-to-set-net-core-in-if-statement-for-compilation
#if ( NETSTANDARD && !NETSTANDARD1_0 ) || NETCORE || NETCOREAPP3_0 || NETCOREAPP3_1
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
#endif
System.Text.Encoding enc = System.Text.Encoding.GetEncoding(ansi);
return enc;
}
catch (System.Exception)
{ }
try
{
foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
{
System.Text.Encoding e = ei.GetEncoding();
// 20'127: US-ASCII
if (e.WindowsCodePage == ansi && e.CodePage != 20127)
{
return e;
}
}
}
catch (System.Exception)
{ }
// return System.Text.Encoding.GetEncoding("iso-8859-1");
return System.Text.Encoding.UTF8;
} // End Function GetSystemEncoding
} // End Class
}
namespace WindowsFormsApp2
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
List<FilePath> filePaths = new List<FilePath>();
filePaths = GetLstPaths();
}
public static List<FilePath> GetLstPaths()
{
#region Getting Files
DirectoryInfo directoryInfo = new DirectoryInfo(#"C:\Users\Safi\Desktop\ss\");
DirectoryInfo directoryTargetInfo = new DirectoryInfo(#"C:\Users\Safi\Desktop\ss1\");
FileInfo[] fileInfos = directoryInfo.GetFiles("*.txt");
List<FilePath> lstFiles = new List<FilePath>();
foreach (FileInfo fileInfo in fileInfos)
{
Encoding enco = GetLittleIndianFiles(directoryInfo + fileInfo.Name);
string filePath = directoryInfo + fileInfo.Name;
string targetFilePath = directoryTargetInfo + fileInfo.Name;
if (enco != null)
{
FilePath f1 = new FilePath();
f1.filePath = filePath;
f1.targetFilePath = targetFilePath;
lstFiles.Add(f1);
}
}
int count = 0;
lstFiles.ForEach(d =>
{
count++;
});
MessageBox.Show(Convert.ToString(count) + "Files are Converted");
#endregion
return lstFiles;
}
public static Encoding GetLittleIndianFiles(string srcFile)
{
byte[] b = new byte[5];
using (System.IO.FileStream file = new System.IO.FileStream(srcFile, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
{
int numRead = file.Read(b, 0, 5);
if (numRead < 5)
System.Array.Resize(ref b, numRead);
file.Close();
} // End Using file
if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE)
return System.Text.Encoding.Unicode; // UTF-16, little-endian
return null;
}
}
public class FilePath
{
public string filePath { get; set; }
public string targetFilePath { get; set; }
}
}
I am looking to read the next UTF8 character from a Stream or BinaryReader. Things that don't work:
BinaryReader::ReadChar -- this will throw on a 3 or 4 byte character. Since it returns a two byte structure, it has no choice.
BinaryReader::ReadChars -- this will throw if you ask it to read 1 character and it encounters a 3 or 4 byte character. Will read multiple characters if you ask it to read more than 1 character.
StreamReader::Read -- this needs to know how many bytes to read, but the number of bytes in a UTF8 character is variable.
The code I have that seems to work:
private char[] ReadUTF8Char(Stream s)
{
byte[] bytes = new byte[4];
var enc = new UTF8Encoding(false, true);
if (1 != s.Read(bytes, 0, 1))
return null;
if (bytes[0] <= 0x7F) //Single byte character
{
return enc.GetChars(bytes, 0, 1);
}
else
{
var remainingBytes =
((bytes[0] & 240) == 240) ? 3 : (
((bytes[0] & 224) == 224) ? 2 : (
((bytes[0] & 192) == 192) ? 1 : -1
));
if (remainingBytes == -1)
return null;
s.Read(bytes, 1, remainingBytes);
return enc.GetChars(bytes, 0, remainingBytes + 1);
}
}
Obviously, this is a bit of a mess, and somewhat specific to UTF8. Is there a more elegant, less custom, easier-to-read solution to this problem?
I know this question is a bit old but here is another solution. It is not as good in performance as the OPs solution (which I also prefer), but it only uses builtin-utf8-functionality without knowing about the utf8-encoding internals.
private static char ReadUTF8Char(Stream s)
{
if (s.Position >= s.Length)
throw new Exception("Error: Read beyond EOF");
using (BinaryReader reader = new BinaryReader(s, Encoding.Unicode, true))
{
int numRead = Math.Min(4, (int)(s.Length - s.Position));
byte[] bytes = reader.ReadBytes(numRead);
char[] chars = Encoding.UTF8.GetChars(bytes);
if (chars.Length == 0)
throw new Exception("Error: Invalid UTF8 char");
int charLen = Encoding.UTF8.GetByteCount(new char[] { chars[0] });
s.Position += (charLen - numRead);
return chars[0];
}
}
The encoding passed to the constructor of BinaryReader doesn't matter. I had to use this version of the constructor to leave the stream open. If you already have a binary reader you can just use this:
private static char ReadUTF8Char(BinaryReader reader)
{
var s = reader.BaseStream;
if (s.Position >= s.Length)
throw new Exception("Error: Read beyond EOF");
int numRead = Math.Min(4, (int)(s.Length - s.Position));
byte[] bytes = reader.ReadBytes(numRead);
char[] chars = Encoding.UTF8.GetChars(bytes);
if (chars.Length == 0)
throw new Exception("Error: Invalid UTF8 char");
int charLen = Encoding.UTF8.GetByteCount(new char[] { chars[0] });
s.Position += (charLen - numRead);
return chars[0];
}
I am trying to achieve the best possible compression for data that consists of just 1s and 0s in a matrix.
To demonstrate what I mean, here's a sample 6 by 6 matrix:
1,0,0,1,1,1
0,1,0,1,1,1
1,0,0,1,0,0
0,1,1,0,1,1
1,0,0,0,0,1
0,1,0,1,0,1
I'd like to compress that into an as small string or byte array as possible. The matrices I will need to compress are bigger though (always 4096 by 4096 1s and 0s).
I suppose it could be compressed quite heavily, but I'm not sure how. I'll mark the best compression as the answer. Performance does not matter.
I assume that you want to compress string into other strings even though your data really is binary. I don't know what the best compression algorithm is (and that will vary depending on your data) but you can convert the input text into bits, compress these and then convert the compressed bytes into a string again using base-64 encoding. This will allow you to go from string to string and still apply a compression algorithm of your choice.
The .NET framework provides the class DeflateStream that will allow you to compress a stream of bytes. The first step is to create a custom Stream that will allow you to read and write your text format. For lack of better name I have named it TextStream. Note that to simplify matters a bit I use \n as the line ending (instead of \r\n).
class TextStream : Stream {
readonly String text;
readonly Int32 bitsPerLine;
readonly StringBuilder buffer;
Int32 textPosition;
// Initialize a readable stream.
public TextStream(String text) {
if (text == null)
throw new ArgumentNullException("text");
this.text = text;
}
// Initialize a writeable stream.
public TextStream(Int32 bitsPerLine) {
if (bitsPerLine <= 0)
throw new ArgumentException();
this.bitsPerLine = bitsPerLine;
this.buffer = new StringBuilder();
}
public override Boolean CanRead { get { return this.text != null; } }
public override Boolean CanWrite { get { return this.buffer != null; } }
public override Boolean CanSeek { get { return false; } }
public override Int64 Length { get { throw new InvalidOperationException(); } }
public override Int64 Position {
get { throw new InvalidOperationException(); }
set { throw new InvalidOperationException(); }
}
public override void Flush() {
}
public override Int32 Read(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanRead)
throw new InvalidOperationException();
var byteCount = 0;
Byte currentByte = 0;
var bitCount = 0;
for (; byteCount < count && this.textPosition < this.text.Length; this.textPosition += 1) {
if (text[this.textPosition] != '0' && text[this.textPosition] != '1')
continue;
currentByte = (Byte) ((currentByte << 1) | (this.text[this.textPosition] == '0' ? 0 : 1));
bitCount += 1;
if (bitCount == 8) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
currentByte = 0;
bitCount = 0;
}
}
if (bitCount > 0) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
}
return byteCount;
}
public override void Write(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanWrite)
throw new InvalidOperationException();
for (var i = 0; i < count; ++i) {
var currentByte = buffer[offset + i];
for (var mask = 0x80; mask > 0; mask /= 2) {
if (this.buffer.Length > 0) {
if ((this.buffer.Length + 1)%(2*this.bitsPerLine) == 0)
this.buffer.Append('\n');
else
this.buffer.Append(',');
}
this.buffer.Append((currentByte & mask) == 0 ? '0' : '1');
}
}
}
public override String ToString() {
if (this.text != null)
return this.text;
else
return this.buffer.ToString();
}
public override Int64 Seek(Int64 offset, SeekOrigin origin) {
throw new InvalidOperationException();
}
public override void SetLength(Int64 length) {
throw new InvalidOperationException();
}
}
Then you can write methods for compressing and decompressing using DeflateStream. Note that the the uncompressed input is a string like the one you have provided in your question an the compressed output is a base-64 encoded string.
String Compress(String text) {
using (var inputStream = new TextStream(text))
using (var outputStream = new MemoryStream()) {
using (var compressedStream = new DeflateStream(outputStream, CompressionMode.Compress))
inputStream.CopyTo(compressedStream);
return Convert.ToBase64String(outputStream.ToArray());
}
}
String Decompress(String compressedText, Int32 bitsPerLine) {
var bytes = Convert.FromBase64String(compressedText);
using (var inputStream = new MemoryStream(bytes))
using (var outputStream = new TextStream(bitsPerLine)) {
using (var compressedStream = new DeflateStream(inputStream, CompressionMode.Decompress))
compressedStream.CopyTo(outputStream);
return outputStream.ToString();
}
}
To test it I used a method to create a random string (using a fixed seed to always create the same string):
String CreateRandomString(Int32 width, Int32 height) {
var random = new Random(0);
var stringBuilder = new StringBuilder();
for (var i = 0; i < width; ++i) {
for (var j = 0; j < height; ++j) {
if (i > 0 && j == 0)
stringBuilder.Append('\n');
else if (j > 0)
stringBuilder.Append(',');
stringBuilder.Append(random.Next(2) == 0 ? '0' : '1');
}
}
return stringBuilder.ToString();
}
Creating a random 4,096 x 4,096 string has an uncompressed size of 33,554,431 characters. This is compressed to 2,797,056 characters which is a reduction to about 8% of the original size.
Skipping the base-64 encoding would increase the compression ratio even more but the output would be binary and not a string. If you also consider the input as binary you actually get the following result for random data with equal probability of 0 and 1:
Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 2,097,792
Size after compression: 100%
Simply converting to bytes is a better than doing that following by a deflate. However, using random input but with 25% 0 and 75% 1 you get this result:
Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 1,757,846
Size after compression: 84%
How much deflate will compress your data really depends of the nature of the data. If it is completely random you wont be able to get much compression after converting from text to bytes.
Hmm... as small as possible is not really possible without knowing the problem domain.
Here's the general approach:
Represent the ones and zeros in the array using bits not bytes or characters or whatever.
Compress using a general purpose loss-less compression algorithm. The two most common are:
Huffman encoding and some type of LZW.
Huffman can be mathematically proven to provide the best possible compression of data, the catch is in order to decompress the data you also need the Huffman tree which may be as big as the original data. LZW gives you compression equivalent to Huffman (within a few percent) for most inputs, but performs best on data with repeating segments such as text.
Implementations for the compression algorithms should be easy to come by (GZIP uses LZ77 which is an earlier slightly less optimal version of LZW.)
A good implementation of compression algorithms using modern algorithms go to 7zip.org. It's open source and they have a C API with a DLL, but you'll have to create the .Net interface (unless someone already made one.)
The non general approach:
This relays on a known characteristic of the data. For example: if you know most of the data is zeroes you can encode only the coordinates of the ones.
If the data contains patches of ones and zeros they can be encoded with RLE or two dimensional variants of the algorithm.
Trying to create your own algorithm for specifically compressing this data will most likely not yield much.
Create a GZipStream with Max CompressionLevel
Run a 4096x4096 loop
- set all 64 bits of a ulong to bits of the array
- when 64 bits are done write the ulong to the compressionstream and start at the first bit again
This will very easily add your cube into a pretty compressed block of memory
Using Huffman Coding you can compress it quite much:
0 => 111
1 => 10
, => 0
\r => 1100
\n => 1101
Yields for you example matrix (in bits):
10011101 11010010 01011001 10111101 00111010 01001011 00110110 01110111
01001110 11111001 10111101 00100111 01001011 00110110 01110111 01110111
01011001 10111101 00111010 0111010
If the commas, line feed and carriage return can be excluded then you only need a BitArray to store each value. Although now you need to know the dimension of the matrix when decoding. If you don't then you could store it as an int and then the data itself if you're planning on serializing the data.
Something like:
var input = #"1,0,0,1,1,1
0,1,0,1,1,1
1,0,0,1,0,0
0,1,1,0,1,1
1,0,0,0,0,1
0,1,0,1,0,1";
var values = new List<bool>();
foreach(var c in input)
{
if (c == '0')
values.Add(false);
else if (c == '1')
values.Add(true);
}
var ba = new BitArray(values.ToArray());
then serialize the BitArray. You'd probably need to add the number of padding bits to properly decode the data. (4096 * 4096 is divisible by 8).
The BitArray approach should get you the most compression unless there is a significant amount of repeating patterns in the matrix (yes I'm assuming the data is mostly random).
Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var messageBuilder = new StringBuilder();
var nextChar = 'x';
while (reader.Peek() >= 0)
{
nextChar = (char)reader.Read()
messageBuilder.Append(nextChar);
if (nextChar == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
}
The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).
This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
How can I modify this code so that it works with multi-byte characters?
Rather than Encoding.UTF8.GetChars which is designed to convert complete buffers, get an instance of Decoder and repeatedly call its member method GetChars this will make use of the Decoder's internal buffer to handle partial multi-byte sequences from the end of one call to the next.
Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];
while ((byteAsInt = stream.ReadByte()) != -1)
{
var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
if(charCount == 0) continue;
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)
var messageBuilder = new List<byte>();
int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
messageBuilder.Add((byte)byteAsInt);
if (byteAsInt == '\r')
{
var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
Console.Write(messageString);
ProcessBuffer(messageString);
messageBuilder.Clear();
}
}
Mike,
I found your solution perfect for my situation as well. But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback."
I changed my code to:
// ...
var nextChar = new char[4]; // 2 might suffice
for (var i = startPos; i < bytesRead; i++)
{
int charCount;
//...
charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);
if (charCount == 0)
{
bytesSkipped++;
continue;
}
for (int ic = 0; ic < charCount; ic++)
{
char c = nextChar[ic];
charPos++;
// Process character here...
}
}