Writing and Reading a big file for analytical purposes

Writing and Reading a big file for analytical purposes - c#

I'm trying to make a DNA Analytical tool, but I'm facing a big problem here.
Here's a screenshot on how the application looks like.
The problem I'm facing is handling large data. I've used streams and memory mapped files, but I'm not really sure if I'm heading in the right direction.
What I'm trying to achieve is to be able to write a text file with 3 billion random letters, and then use that text file for later purposes.
Currently i'm at 3000 letters, but generating more then that takes ages. How would you tackle this? Storing the full text file into a string seems like overload to me. Any ideas?
private void WriteDNASequence(string dnaFile)
{
Dictionary<int, char> neucleotides = new Dictionary<int, char>();
neucleotides.Add(0, 'A');
neucleotides.Add(1, 'T');
neucleotides.Add(2, 'C');
neucleotides.Add(3, 'G');
int BasePairs = 3000;
using (StreamWriter sw = new StreamWriter(filepath + #"\" + dnaFile))
{
for (int i = 0; i < (BasePairs / 2); i++)
{
int neucleotide = RandomNumber(0, 4);
sw.Write(neucleotides[neucleotide]);
}
}
}
private string ReadDNASequence(string dnaFile)
{
_DNAData = "";
using (StreamReader file = new StreamReader(filepath + #"\" + dnaFile))
{
_DNAData = file.ReadToEnd();
}
return _DNAData;
}
//Function to get a random number
private static readonly Random random = new Random();
private static readonly object syncLock = new object();
public static int RandomNumber(int min, int max)
{
lock (syncLock)
{ // synchronize
return random.Next(min, max);
}
}

When working with such big amount of data - every bit matters and you have to pack data as dense as possible.
As of now, each nucleotide is represented by one char, and one char in encoding you use (that's UTF-8 by default) takes 1 byte (for those 4 chars you use).
But since you have just 4 different characters - each character holds only 2 bits of information, so we can represent them as:
00 - A
01 - T
10 - C
11 - G
That means we can pack 4 nucleotides in one byte, making output file size 4 times smaller.
Assuming you have such map:
static readonly Dictionary<char, byte> _neucleotides = new Dictionary<char, byte> {
{ 'A', 0},
{ 'T', 1},
{ 'C', 2},
{ 'G', 3}
};
static readonly Dictionary<int, char> _reverseNucleotides = new Dictionary<int, char> {
{0, 'A'},
{1, 'T'},
{2, 'C'},
{3, 'G'}
};
You can pack 4 nucleotides like in one byte like this:
string toPack = "ATCG";
byte packed = 0;
for (int i = 0; i < 4; i++) {
packed = (byte) (packed | _neucleotides[toPack[i]] << (i * 2));
}
And unpack back like this:
string unpacked = new string(new[] {
_reverseNucleotides[packed & 0b11],
_reverseNucleotides[(packed & 0b1100) >> 2],
_reverseNucleotides[(packed & 0b110000) >> 4],
_reverseNucleotides[(packed & 0b11000000) >> 6],
});
As for writing bytes to file, I think that's easy enough. If you need some random data in this case, use:
int chunkSize = 1024 * 1024; // 8 million pairs at once (since each byte is 4 nucleotides)
byte[] chunk = new byte[chunkSize];
random.NextBytes(chunk);
// fileStream is instance of `FileStream`, no need for `StreamWriter`
fileStream.Write(chunk, 0, chunk.Length);
There are some caveats (like last byte in a file might store not 4 nucleotides but less), but I hope you'll figure that out yourself.
With that approach (packing in binary, generating big random chunk at once, writing big chunk to file) - generating 3 billion pairs took 8 seconds on my very old (7 years) HDD, and output size is 350MB. You can even read all that 350MB into memory at once if necessary.

Related

Conversion of Hex string to decimal

Guys i need some help converting this c program to python since i am not a c guy whatsoever. I'm attempting to convert this hex string. I was told the following hex string was a series of position and load integer values. The developers exact words were:
0x00703D450080474500704A4500E04B4500E04B4500D04A4500E0484500B0464500C044450090434500404345009043450000444500F0434500F0424500F0404500403E4500703B45000039450020374500F03545006035450090354500A0364500E0384500403C450090404500C0444500A04745008047450030434500D0394500702B4500A0184500A002450080D5440000A54400006C440040144400008943000050C100808AC3004003C400803EC4008077C4002096C40040ADC400A0BFC40040CCC40020D3C40040D5C40020D4C40040D1C40000CEC400C0CAC40080C7C40020C4C40000C1C40020BEC40040BCC40060BBC40060BBC400C0BBC400E0BBC40060BBC40080BAC400C0B9C400C0B9C400C0BAC400C0BCC40040BFC400E0C1C40020C4C400A0C5C40060C5C40080C2C400E0BAC40000ADC4002097C400C072C400402AC40080B4C3000014C200808443000007440080444400C07C440040994400E0B3440000CE4400A0E6440000FD440050084500E0104500901845001020450090274500302F4500A0364500703D45295C5A42B81E55423D0A5242C3F54E4233334C4248E149427B144842A47046425C8F44420AD7414214AE3D4214AE3742CDCC2F42E17A2642A4701C425C8F12428FC20942E17A0242A470F94114AEEF419A99E541CDCCD8419A99C741B81EB14100009641EC517041EC5134413333FB40EC51A040C3F5384052B8BE3F3333333F0AD7A33E7B142E3EAE47E13D8FC2753D0AD7233C000000000AD7233D7B142E3EA470BD3E9A99193F295C4F3F8FC2753F713D8A3FF6289C3F9A99B93FCDCCEC3F1F851B4048E14A40000080409A999940C3F5B0408FC2C54048E1DA40CDCCF440D7A30C415C8F2641CDCC484148E1724133339141F628AA416666C241D7A3D841F628EC413333FD416666064233330E42CDCC16428FC22042B81E2C425C8F3842333345423D0A514200005B42A4706242333367425C8F6942F6286A42CDCC69423D0A69426666684285EB67421F856742F628674214AE66427B146642A470654248E16442EC5164420AD76342AE476342D7A362420AD76142C3F5604200006042C3F55E428FC25D42AE475C42295C5A42
"The hex string is a series of 2 byte integers in pairs of Position and Load. The first 2 bytes is Position1, the next 2=Load1, next 2=Position2, next 2=Load2, etc...
A byte is 2 Hex characters. "
Here is the c# that was provided to me by the developer with little context behind it
public class PositionLoadPoint
{
public float Position { get; private set; }
public float Load { get; private set; }
public PositionLoadPoint(float position, float load)
{
Position = position;
Load = load;
}
}
This method should return a list of points from an array of bytes:
public static IList<PositionLoadPoint> GetPositionLoadPoints(byte[] bytes)
{
IList<PositionLoadPoint> result = new List<PositionLoadPoint>();
int midIndex = bytes.Length / 2;
for (int i = 0; i < midIndex; i += 4)
{
byte[] load = new byte[4];
byte[] position = new byte[4];
Array.Copy(bytes, i, load, 0, 4);
Array.Copy(bytes, midIndex + i, position, 0, 4);
var point = new PositionLoadPoint(BitConverter.ToSingle(load, 0),
BitConverter.ToSingle(position, 0));
result.Add(point);
}
return result;
}
I'm struggling with this and its driving me crazy because i believe it should be crazy simple. Here is my python that i wrote, but i do not believe the results are correct since the plot is sporatic!
#INSERT LIBRARIES
import matplotlib.pyplot as plt
hex_string = '00703D450080474500704A4500E04B4500E04B4500D04A4500E0484500B0464500C044450090434500404345009043450000444500F0434500F0424500F0404500403E4500703B45000039450020374500F03545006035450090354500A0364500E0384500403C450090404500C0444500A04745008047450030434500D0394500702B4500A0184500A002450080D5440000A54400006C440040144400008943000050C100808AC3004003C400803EC4008077C4002096C40040ADC400A0BFC40040CCC40020D3C40040D5C40020D4C40040D1C40000CEC400C0CAC40080C7C40020C4C40000C1C40020BEC40040BCC40060BBC40060BBC400C0BBC400E0BBC40060BBC40080BAC400C0B9C400C0B9C400C0BAC400C0BCC40040BFC400E0C1C40020C4C400A0C5C40060C5C40080C2C400E0BAC40000ADC4002097C400C072C400402AC40080B4C3000014C200808443000007440080444400C07C440040994400E0B3440000CE4400A0E6440000FD440050084500E0104500901845001020450090274500302F4500A0364500703D45295C5A42B81E55423D0A5242C3F54E4233334C4248E149427B144842A47046425C8F44420AD7414214AE3D4214AE3742CDCC2F42E17A2642A4701C425C8F12428FC20942E17A0242A470F94114AEEF419A99E541CDCCD8419A99C741B81EB14100009641EC517041EC5134413333FB40EC51A040C3F5384052B8BE3F3333333F0AD7A33E7B142E3EAE47E13D8FC2753D0AD7233C000000000AD7233D7B142E3EA470BD3E9A99193F295C4F3F8FC2753F713D8A3FF6289C3F9A99B93FCDCCEC3F1F851B4048E14A40000080409A999940C3F5B0408FC2C54048E1DA40CDCCF440D7A30C415C8F2641CDCC484148E1724133339141F628AA416666C241D7A3D841F628EC413333FD416666064233330E42CDCC16428FC22042B81E2C425C8F3842333345423D0A514200005B42A4706242333367425C8F6942F6286A42CDCC69423D0A69426666684285EB67421F856742F628674214AE66427B146642A470654248E16442EC5164420AD76342AE476342D7A362420AD76142C3F5604200006042C3F55E428FC25D42AE475C42295C5A42'
#convert all two digit hex to decimal and place in list
hex_list = []
for i in range(0,len(hex_string),2):
hex_list.append(int(hex_string[i:i+2],16))
#GROUP TWO CONSECUTIVE DECIMALS IN HEX_LIST TOGETHER UNTIL ALL DECIMALS ARE GROUPED INTO PAIRS WITHIN A LIST
DEC_list_pair = []
for i in range(0,len(hex_list),2):
DEC_list_pair.append(hex_list[i:i+2])
#Create a x and y axis using the DEC_list_pair_no_duplicates list
x_axis = []
y_axis = []
for i in range(0,len(DEC_list_pair)):
x_axis.append(DEC_list_pair[i][0])
y_axis.append(DEC_list_pair[i][1])
#plot x_axis and y_axis
plt.plot(x_axis, y_axis)
plt.show()

Looks like the description doesn't match the C# code.
int midIndex = bytes.Length / 2;
for (int i = 0; i < midIndex; i += 4)
{
// Reading into 'load' from offset i from front of the array
Array.Copy(bytes, i, load, 0, 4);
// Reading into 'position' from offset i from midIndex of the array
Array.Copy(bytes, midIndex + i, position, 0, 4);
var point = new PositionLoadPoint(BitConverter.ToSingle(load, 0),
BitConverter.ToSingle(position, 0));
All the loads are coming from the front of the array, all the postions are coming after "midIndex".

How to remove duplicate lines from a large text file efficiently?

I want to edit a text as each line exists once in it. Each lines contains constantly 10 characters. I am generally working on 5-6 million of lines. So the code i am using currently is consuming too much RAM.
My code:
File.WriteAllLines(targetpath, File.ReadAllLines(sourcepath).Distinct())
So how can I make it less RAM consumer and less time-consumer at the same time?

Taking into account how much memory a string will take in C#, and assuming 10 characters length for 6 million records we get:
size in bytes ~= 20 + (length / 2 ) * 4;
total size in bytes ~= (20 + ( 10 / 2 ) * 4 )* 6000000 = 240 000 000
total size in Mb ~= 230
Now, 230 MB of space is not really a problem, even on x86 (32 bit system), so you can load all that data in memory.
For this, I would use a HashSet class which is obviously, a hash set that will let you easily eliminate the duplicates, by using lookup before adding an element.
In terms of big-O notation for time complexity, the average performance of a lookup in a hash set is O(1), which is the best you can get. In total, you would use lookup N times, totalling to N * O(1) = O(N)
In terms of big-O notation for space complexity, you would have O(N) space used, meaning that you use up memory proportional to number of elements, which is also the best you can get.
I'm not sure it is even possible to use up less space if you implement the algorithm in C# and not rely on any external components (that would also use at least O(N))
That being said, you can optimize for some scenarios by reading your file sequentially, line by line, see here.
This would give a better result if you have lots of duplicates, but worst case scenario when all the lines are distinct would consume the same amount of memory.
On a final note, if you look how Distinct method is implemented, you will see it also uses an implementation of hash table, although it's not the same class, but the performance is still roughly the same, check out this question for more details.

As ironstone13 corrected me, HashSet is OK, but does store the data.
Then this works fine too:
string[] arr = File.ReadAllLines("file.txt");
HashSet<string> hashes = new HashSet<string>();
for (int i = 0; i < arr.Length; i++)
{
if (!hashes.Add(arr[i])) arr[i] = null;
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
This implementation was motivated by memory performance and hash conflicts.
The main idea was to keep just hashes, of course it would have to get back to file to get the line it sees as hash conflict/duplicit, to detect which one it is. (that part is not implemented).
class Program
{
static string[] arr;
static Dictionary<int, int>[] hashes = new Dictionary<int, int>[1]
{ new Dictionary<int, int>() }
;
static int[] file_indexes = {-1};
static void AddHash(int hash, int index)
{
for (int h = 0; h < hashes.Length; h++)
{
Dictionary<int, int> dict = hashes[h];
if (!dict.ContainsKey(hash))
{
dict[hash] = index;
return;
}
}
hashes = hashes.Union(new[] {new Dictionary<int, int>() {{hash, index}}}).ToArray();
file_indexes = Enumerable.Range(0, hashes.Length).Select(x => -1).ToArray();
}
static int UpdateFileIndexes(int hash)
{
int updates = 0;
for (int h = 0; h < hashes.Length; h++)
{
int index;
if (hashes[h].TryGetValue(hash, out index))
{
file_indexes[h] = index;
updates++;
}
else
{
file_indexes[h] = -1;
}
}
return updates;
}
static bool IsDuplicate(int index)
{
string str1 = arr[index];
for (int h = 0; h < hashes.Length; h++)
{
int i = file_indexes[h];
if (i == -1 || index == i) continue;
string str0 = arr[i];
if (str0 == null) continue;
if (string.CompareOrdinal(str0, str1) == 0) return true;
}
return false;
}
static void Main(string[] args)
{
arr = File.ReadAllLines("file.txt");
for (int i = 0; i < arr.Length; i++)
{
int hash = arr[i].GetHashCode();
if (UpdateFileIndexes(hash) == 0) AddHash(hash, i);
else if (IsDuplicate(i)) arr[i] = null;
else AddHash(hash, i);
}
File.WriteAllLines("file2.txt", arr.Where(x => x != null));
Console.WriteLine("DONE");
Console.ReadKey();
}
}

Before you write your data, if your data is in a list or dictionary, you could run LINQ query and use group by to group all like keys. Then for each write to the output file.
Your question is a little vague as well. Are you creating a next text file every time and do you have to store the data in text? There are better formats to use such as XML and json

Cryptograhically random unique strings

In this answer, the below code was posted for creating unique random alphanumeric strings. Could someone clarify for me how exactly they are ensured to be unique in this code and to what extent these are unique? If I rerun this method on different occasions would I still get unique strings?
Or did I just misunderstand the reply and these are not generating unique keys at all, only random?
I already asked this in a comment to that answer but the user seems to be inactive.
public static string GetUniqueKey()
{
int maxSize = 8;
char[] chars = new char[62];
string a;
a = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
chars = a.ToCharArray();
int size = maxSize;
byte[] data = new byte[1];
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
crypto.GetNonZeroBytes(data);
size = maxSize;
data = new byte[size];
crypto.GetNonZeroBytes(data);
StringBuilder result = new StringBuilder(size);
foreach (byte b in data)
{ result.Append(chars[b % (chars.Length - 1)]); }
return result.ToString();
}

There is nothing in the code that guarantees that the result is unique. To get a unique value you either have to keep all previous values so that you can check for duplicates, or use a lot longer codes so that duplicates are practically impossible (e.g. a GUID). The code contains less than 48 bits of information, which is a lot less than the 128 bits of a GUID.
The string is just random, and although a crypto strength random generator is used, that is ruined by how the code is generated from the random data. There are some issues in the code:
A char array is created, that is just thrown away and replaced with another.
A one byte array of random data is created for no apparent reason at all, as it's not used for anything.
The GetNonZeroBytes method is used instead of the GetBytes method, which adds a skew to the distribution of characters as the code does nothing to handle the lack of zero values.
The modulo (%) operator is used to reduce the random number down to the number of characters used, but the random number can't be evenly divided into the number of characters, which also adds a skew to the distribution of characters.
chars.Length - 1 is used instead of chars.Length when the number is reduced, which means that only 61 of the predefined 62 characters can occur in the string.
Although those issues are minor, they are important when you are dealing with crypo strength randomness.
A version of the code that would produce a string without those issues, and give a code with enough information to be considered practically unique:
public static string GetUniqueKey() {
int size = 16;
byte[] data = new byte[size];
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
crypto.GetBytes(data);
return BitConverter.ToString(data).Replace("-", String.Empty);
}

Uniqueness and randomness are mutually exclusive concepts. If a random number generator is truly random, then it can return the same value. If values are truly unique, although they may not be deterministic, they certainly aren't truly random, because every value generated removes a value from the pool of allowed values. This means that every run affects the outcome of subsequent runs, and at a certain point the pool is exhausted (barring of course the possibility of an infinitely-sized pool of allowed values, but the only way to avoid collisions in such a pool would be the use of a deterministic method of choosing values).
The code you're showing generates values that are very random, but not 100% guaranteed to be unique. After enough runs, there will be a collision.

I need to generate 7 characters of an alphanumeric string. With a small search, I wrote the below code. Performance results are uploaded above
I have used hashtable Class to guarantee uniqueness and also used RNGCryptoServiceProvider Class to get high-quality random chars
results of generating 100.000 - 1.000.000 - 10.000.000 sample
Generating unique strings;
thanks to nipul parikh
public static Tuple<List<string>, List<string>> GenerateUniqueList(int count)
{
uniqueHashTable = new Hashtable();
nonUniqueList = new List<string>();
uniqueList = new List<string>();
for (int i = 0; i < count; i++)
{
isUniqueGenerated = false;
while (!isUniqueGenerated)
{
uniqueStr = GetUniqueKey();
try
{
uniqueHashTable.Add(uniqueStr, "");
isUniqueGenerated = true;
}
catch (Exception ex)
{
nonUniqueList.Add(uniqueStr);
// Non-unique generated
}
}
}
uniqueList = uniqueHashTable.Keys.Cast<string>().ToList();
return new Tuple<List<string>, List<string>>(uniqueList, nonUniqueList);
}
public static string GetUniqueKey()
{
int size = 7;
char[] chars = new char[62];
string a = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
chars = a.ToCharArray();
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
byte[] data = new byte[size];
crypto.GetNonZeroBytes(data);
StringBuilder result = new StringBuilder(size);
foreach (byte b in data)
result.Append(chars[b % (chars.Length - 1)]);
return Convert.ToString(result);
}
Whole Console Application Code below;
class Program
{
static string uniqueStr;
static Stopwatch stopwatch;
static bool isUniqueGenerated;
static Hashtable uniqueHashTable;
static List<string> uniqueList;
static List<string> nonUniqueList;
static Tuple<List<string>, List<string>> generatedTuple;
static void Main(string[] args)
{
int i = 0, y = 0, count = 100000;
while (i < 10 && y < 4)
{
stopwatch = new Stopwatch();
stopwatch.Start();
generatedTuple = GenerateUniqueList(count);
stopwatch.Stop();
Console.WriteLine("Time elapsed: {0} --- {1} Unique --- {2} nonUnique",
stopwatch.Elapsed,
generatedTuple.Item1.Count().ToFormattedInt(),
generatedTuple.Item2.Count().ToFormattedInt());
i++;
if (i == 9)
{
Console.WriteLine(string.Empty);
y++;
count *= 10;
i = 0;
}
}
Console.ReadLine();
}
public static Tuple<List<string>, List<string>> GenerateUniqueList(int count)
{
uniqueHashTable = new Hashtable();
nonUniqueList = new List<string>();
uniqueList = new List<string>();
for (int i = 0; i < count; i++)
{
isUniqueGenerated = false;
while (!isUniqueGenerated)
{
uniqueStr = GetUniqueKey();
try
{
uniqueHashTable.Add(uniqueStr, "");
isUniqueGenerated = true;
}
catch (Exception ex)
{
nonUniqueList.Add(uniqueStr);
// Non-unique generated
}
}
}
uniqueList = uniqueHashTable.Keys.Cast<string>().ToList();
return new Tuple<List<string>, List<string>>(uniqueList, nonUniqueList);
}
public static string GetUniqueKey()
{
int size = 7;
char[] chars = new char[62];
string a = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
chars = a.ToCharArray();
RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();
byte[] data = new byte[size];
crypto.GetNonZeroBytes(data);
StringBuilder result = new StringBuilder(size);
foreach (byte b in data)
result.Append(chars[b % (chars.Length - 1)]);
return Convert.ToString(result);
}
}
public static class IntExtensions
{
public static string ToFormattedInt(this int value)
{
return string.Format(CultureInfo.InvariantCulture, "{0:0,0}", value);
}
}

Using strictly alphanumeric characters restricts the pool you draw from to 62. Using the complete printable character set(ASCII 32-126) increases your pool to 94, decreasing the likelihood of collision and eliminating creating the pool separately.

Efficiently convert byte array to Decimal

If I have a byte array and want to convert a contiguous 16 byte block of that array, containing .net's representation of a Decimal, into a proper Decimal struct, what is the most efficient way to do it?
Here's the code that showed up in my profiler as the biggest CPU consumer in a case that I'm optimizing.
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
using (MemoryStream stream = new MemoryStream(src))
{
stream.Position = offset;
using (BinaryReader reader = new BinaryReader(stream))
return reader.ReadDecimal();
}
}
To get rid of MemoryStream and BinaryReader, I thought feeding an array of BitConverter.ToInt32(src, offset + x)s into the Decimal(Int32[]) constructor would be faster than the solution I present below, but the version below is, strangely enough, twice as fast.
const byte DecimalSignBit = 128;
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
return new decimal(
BitConverter.ToInt32(src, offset),
BitConverter.ToInt32(src, offset + 4),
BitConverter.ToInt32(src, offset + 8),
src[offset + 15] == DecimalSignBit,
src[offset + 14]);
}
This is 10 times as fast as the MemoryStream/BinaryReader combo, and I tested it with a bunch of extreme values to make sure it works, but the decimal representation is not as straightforward as that of other primitive types, so I'm not yet convinced it works for 100% of the possible decimal values.
In theory however, there could be a way to copy those 16 contiguous byte to some other place in memory and declare that to be a Decimal, without any checks. Is anyone aware of a method to do this?
(There's only one problem: Although decimals are represented as 16 bytes, some of the possible values do not constitute valid decimals, so doing an uncheckedmemcpy could potentially break things...)
Or is there any other faster way?

Even though this is an old question, I was a bit intrigued, so decided to run some experiments. Let's start with the experiment code.
static void Main(string[] args)
{
byte[] serialized = new byte[16 * 10000000];
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000000; ++i)
{
decimal d = i;
// Serialize
using (var ms = new MemoryStream(serialized))
{
ms.Position = (i * 16);
using (var bw = new BinaryWriter(ms))
{
bw.Write(d);
}
}
}
var ser = sw.Elapsed.TotalSeconds;
sw = Stopwatch.StartNew();
decimal total = 0;
for (int i = 0; i < 10000000; ++i)
{
// Deserialize
using (var ms = new MemoryStream(serialized))
{
ms.Position = (i * 16);
using (var br = new BinaryReader(ms))
{
total += br.ReadDecimal();
}
}
}
var dser = sw.Elapsed.TotalSeconds;
Console.WriteLine("Time: {0:0.00}s serialization, {1:0.00}s deserialization", ser, dser);
Console.ReadLine();
}
Result: Time: 1.68s serialization, 1.81s deserialization. This is our baseline. I also tried Buffer.BlockCopy to an int[4], which gives us 0.42s for deserialization. Using the method described in the question, deserialization goes down to 0.29s.
In theory however, there could be a way to copy those 16 contiguous
byte to some other place in memory and declare that to be a Decimal,
without any checks. Is anyone aware of a method to do this?
Well yes, the fastest way to do this is to use unsafe code, which is okay here because decimals are value types:
static unsafe void Main(string[] args)
{
byte[] serialized = new byte[16 * 10000000];
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 10000000; ++i)
{
decimal d = i;
fixed (byte* sp = serialized)
{
*(decimal*)(sp + i * 16) = d;
}
}
var ser = sw.Elapsed.TotalSeconds;
sw = Stopwatch.StartNew();
decimal total = 0;
for (int i = 0; i < 10000000; ++i)
{
// Deserialize
decimal d;
fixed (byte* sp = serialized)
{
d = *(decimal*)(sp + i * 16);
}
total += d;
}
var dser = sw.Elapsed.TotalSeconds;
Console.WriteLine("Time: {0:0.00}s serialization, {1:0.00}s deserialization", ser, dser);
Console.ReadLine();
}
At this point, our result is: Time: 0.07s serialization, 0.16s deserialization. Pretty sure that's the fastest this is going to get... still, you have to accept unsafe here, and I assume stuff is written the same way as it's read.

#Eugene Beresovksy read from a stream is very costly. MemoryStream is certainly a powerful and versatile tool, but it has a pretty high cost to a direct reading a binary array. Perhaps because of this the second method performs better.
I have a 3rd solution for you, but before I write it, it is necessary to say that I haven't tested the performance of it.
public static decimal ByteArrayToDecimal(byte[] src, int offset)
{
var i1 = BitConverter.ToInt32(src, offset);
var i2 = BitConverter.ToInt32(src, offset + 4);
var i3 = BitConverter.ToInt32(src, offset + 8);
var i4 = BitConverter.ToInt32(src, offset + 12);
return new decimal(new int[] { i1, i2, i3, i4 });
}
This is a way to make the building based on a binary without worrying about the canonical of System.Decimal. It is the inverse of the default .net bit extraction method:
System.Int32[] bits = Decimal.GetBits((decimal)10);
EDITED:
This solution perhaps don't peform better but also don't have this problem: "(There's only one problem: Although decimals are represented as 16 bytes, some of the possible values do not constitute valid decimals, so doing an uncheckedmemcpy could potentially break things...)".

C# compress a byte array

I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.
I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.
public static class RLE
{
public static byte[] Encode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 0; i < source.Length; i++)
{
runLength = 1;
while (runLength < byte.MaxValue
&& i + 1 < source.Length
&& source[i] == source[i + 1])
{
runLength++;
i++;
}
dest.Add(runLength);
dest.Add(source[i]);
}
return dest.ToArray();
}
public static byte[] Decode(byte[] source)
{
List<byte> dest = new List<byte>();
byte runLength;
for (int i = 1; i < source.Length; i+=2)
{
runLength = source[i - 1];
while (runLength > 0)
{
dest.Add(source[i]);
runLength--;
}
}
return dest.ToArray();
}
}
I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.
public class LZW
{
/* Compress a string to a list of output symbols. */
public static int[] compress(string uncompressed)
{
// Build the dictionary.
int dictSize = 256;
Dictionary<string, int> dictionary = new Dictionary<string, int>();
for (int i = 0; i < dictSize; i++)
dictionary.Add("" + (char)i, i);
string w = "";
List<int> result = new List<int>();
for (int i = 0; i < uncompressed.Length; i++)
{
char c = uncompressed[i];
string wc = w + c;
if (dictionary.ContainsKey(wc))
w = wc;
else
{
result.Add(dictionary[w]);
// Add wc to the dictionary.
dictionary.Add(wc, dictSize++);
w = "" + c;
}
}
// Output the code for w.
if (w != "")
result.Add(dictionary[w]);
return result.ToArray();
}
/* Decompress a list of output ks to a string. */
public static string decompress(int[] compressed)
{
int dictSize = 256;
Dictionary<int, string> dictionary = new Dictionary<int, string>();
for (int i = 0; i < dictSize; i++)
dictionary.Add(i, "" + (char)i);
string w = "" + (char)compressed[0];
string result = w;
for (int i = 1; i < compressed.Length; i++)
{
int k = compressed[i];
string entry = "";
if (dictionary.ContainsKey(k))
entry = dictionary[k];
else if (k == dictSize)
entry = w + w[0];
result += entry;
// Add w+entry[0] to the dictionary.
dictionary.Add(dictSize++, w + entry[0]);
w = entry;
}
return result;
}
}

Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.

The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.
This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.
This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.
A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:
http://www.shikadi.net/moddingwiki/RLE_Compression
Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.

Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.