Fixed length string from an integer - reduce allocation? - c#

I am currently trying to write a fixed length file of multiple Gigabytes in size, and would like to reduce the string allocation of the process.
Currently I am doing this:
using (var streamWriter = new StreamWriter(fileName))
{
for (int i = 0; i < rowsCount; i++)
{
foreach (var cellInfo in cellInfos)
streamWriter.Write($"{cellInfos.Info1[i],-10}{cellInfos.Info2[i],-10}");
streamWriter.WriteLine();
}
}
Since it is a fixed length file I have condidered instead of calling Write(string) to call Write(byte[]) and somehow turn the int values to an byte array with padding.
cellInfos.Info1[i] and cellInfos.Info2[i] are both of type int, so the fixed length file will look something like this at the end:
114001 2979993 1 2238001 324585 2985606 2980884 2532097 2532097 2980884 1 2985606 1 2985606 1492803 1 2791025 2485160 1492803 1 2106095 2720047 971452 2952105 1 2985606 1 2985606 324585 2985606 1195045 2985606
365001 2973993 1402746 1442502 223569 2738205 1121779 2744137 2744137 1121779 1990399 995210 1990399 995210 1759696 1944745 1598017 2485161 1703338 1195696 979974 2720052 251107 2975092 1990399 995210 1990399 995210 223569 2738205 1130022 995240
Pseudo code I have in mind
//each cellInfos will output 2 values and each value will have a maxLength of 10
var byteArray = new byte[cellInfos.Count * 2 * 10];
var newLineBytes = Encoding.Default.GetBytes(Environment.NewLine);
using (FileStream fs = File.Create(fileName)))
{
for (int i = 0; i < rowsCount; i++)
{
foreach (var cellInfo in cellInfos)
{
//set values into the array
byteArray[0] = cellInfos.Info1[i].GetFirstByte();
byteArray[1] = cellInfos.Info1[i].GetSecondByte();
...
}
fs.Write(byteArray);
fs.Write(newLineBytes);
}
}
I have currently no idea how I could turn my pseudo code into reality.

PipeWriter backed by MemoryPool should reduce allocations and GC pressure. See example below...
NOTE - THIS HAS NEVER BEEN RUN. MAKE SURE TO DILLIGENCE THE CODE...
// set path and replace stringData with your row in the code below.
var path = "file to write";
Span<byte> newLine = new byte[] { (byte)'\n', 0x00 };
Span<char> charBuffer = new char[10]; Span<byte> byteBuffer = new byte[10];
using var writeStream = new FileStream(path, FileMode.Create,
FileAccess.Write, FileShare.None, bufferSize: 4096, FileOptions.SequentialScan);
// adjust buffer from 4096 -> 512 or increase to 2 << 20 for higher throughput
PipeWriter writer = PipeWriter.Create(writeStream,
new StreamPipeWriterOptions(MemoryPool<byte>.Shared, 4096, false));
//loop and write your data...
for (int row = 0; row < rowsCount; row++)
{
for(int col = 0; col < colCount; col++)
{
cellInfos[row][col].Info1.TryFormat(charBuffer, out var written);
Encoding.UTF8.GetBytes(charBuffer, byteBuffer);
writer.Write(byteBuffer);
charBuffer.Clear(); byteBuffer.Clear();
cellInfos[row][col].Info2.TryFormat(charBuffer, out written);
Encoding.UTF8.GetBytes(charBuffer, byteBuffer);
writer.Write(byteBuffer);
charBuffer.Clear(); byteBuffer.Clear();
}
writer.Write(newLine);
}
If this isn't sufficient, I would suggest StringPool from MS HighPerformance Toolkit, but it's probably overkill.

Use Span<char> and TryFormat method.
Span<char> span = stackalloc char[10];
using (var streamWriter = new StreamWriter(fileName))
{
...
cellInfo.Info1.TryFormat(span, out charsWritten);
span[charsWritten..].Fill(' ');
streamWriter.Write(span);
cellInfo.Info2.TryFormat(span, out charsWritten);
span[charsWritten..].Fill(' ');
streamWriter.Write(span);
streamWriter.WriteLine();
...
}
This will eliminate memory allocations for strings.

Related

Convert byte array to array segments of a certain length

I have a byte array and I would like to return sequential chuncks (in the form of new byte arrays) of a certain size.
I tried:
originalArray = BYTE_ARRAY
var segment = new ArraySegment<byte>(originalArray,0,640);
byte[] newArray = new byte[640];
for (int i = segment.Offset; i <= segment.Count; i++)
{
newArray[i] = segment.Array[i];
}
Obviously this only creates an array of the first 640 bytes from the original array. Ultimately, I want a loop that goes through the first 640 bytes and returns an array of those bytes, then it goes through the NEXT 640 bytes and returns an array of THOSE bytes. The purpose of this is to send messages to a server and each message must contain 640 bytes. I cannot garauntee that the original array length is divisible by 640.
Thanks
if speed isn't a concern
var bytes = new byte[640 * 6];
for (var i = 0; i <= bytes.Length; i+=640)
{
var chunk = bytes.Skip(i).Take(640).ToArray();
...
}
Alternatively you could use
Span.Slice Method
Buffer.BlockCopy(Array, Int32, Array, Int32, Int32) Method
Span
Span<byte> bytes = arr; // Implicit cast from T[] to Span<T>
...
slicedBytes = bytes.Slice(i, 640);
BlockCopy
Note this will probably be the fastest of the 3
var chunk = new byte[640]
Buffer.BlockCopy(bytes, i, chunk, 0, 640);
If you truly want to make new arrays from each 640 byte chunk, then you're looking for .Skip and .Take
Here's a working example (and a repl of the example) that I hacked together.
using System;
using System.Linq;
using System.Text;
using System.Collections;
using System.Collections.Generic;
class MainClass {
public static void Main (string[] args) {
// mock up a byte array from something
var seedString = String.Join("", Enumerable.Range(0, 1024).Select(x => x.ToString()));
var byteArrayInput = Encoding.ASCII.GetBytes(seedString);
var skip = 0;
var take = 640;
var total = byteArrayInput.Length;
var output = new List<byte[]>();
while (skip + take < total) {
output.Add(byteArrayInput.Skip(skip).Take(take).ToArray());
skip += take;
}
output.ForEach(c => Console.WriteLine($"chunk: {BitConverter.ToString(c)}"));
}
}
It's really probably better to actually use the ArraySegment properly --unless this is an assignment to learn LINQ extensions.
You can write a generic helper method like this:
public static IEnumerable<T[]> AsBatches<T>(T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
Then you can use it in a foreach loop:
byte[] byteArray = new byte[123456];
foreach (var batch in AsBatches(byteArray, 640))
{
Console.WriteLine(batch.Length); // Do something with the batch.
}
Or if you want a list of batches just do this:
List<byte[]> listOfBatches = AsBatches(byteArray, 640).ToList();
If you want to get fancy you could make it an extension method, but this is only recommended if you will be using it a lot (don't make an extension method for something you'll only be calling in one place!).
Here I've changed the name to InChunksOf() to make it more readable:
public static class ArrayExt
{
public static IEnumerable<T[]> InChunksOf<T>(this T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
}
Which you could use like this:
byte[] byteArray = new byte[123456];
// ... initialise byteArray[], then:
var listOfChunks = byteArray.InChunksOf(640).ToList();
[EDIT] Corrected loop terminator from r > n to r >= n.

Large data table to multiple csv files of specific size in .net

I have one large data table of some millions records. I need to export that into multiple CSV files of specific size. So for example, I choose file size of 5MB and when I say export, The Datatable will get exported to 4 CSV files each of size 5MB and last file size may vary due to remaining records. I went through many solutions here as well had a look at csvhelper library but all deals with large files gets split into multiple CSV but not the in memory data table to multiple CSV files based on the file size specified. I want to do this in C#. Any help in this direction would be great.
Thanks
Jay
Thanks #H.G.Sandhagen and #jdweng for the inputs. Currently I have written following code which does the work needed. I know it is not perfect and some enhancement can surely be done and can be made more efficient if we can pre-determine length out of data table item array as pointed out by Nick.McDermaid. As of now, I will go with this code to unblock my self and will post the final optimized version when I have it coded.
public void WriteToCsv(DataTable table, string path, int size)
{
int fileNumber = 0;
StreamWriter sw = new StreamWriter(string.Format(path, fileNumber), false);
//headers
for (int i = 0; i < table.Columns.Count; i++)
{
sw.Write(table.Columns[i]);
if (i < table.Columns.Count - 1)
{
sw.Write(",");
}
}
sw.Write(sw.NewLine);
foreach (DataRow row in table.AsEnumerable())
{
sw.WriteLine(string.Join(",", row.ItemArray.Select(x => x.ToString())));
if (sw.BaseStream.Length > size) // Time to create new file!
{
sw.Close();
sw.Dispose();
fileNumber ++;
sw = new StreamWriter(string.Format(path, fileNumber), false);
}
}
sw.Close();
}
I had a similar problem and this is how I solved it with CsvHelper.
Answer could be easily adapted to use DataTable as source.
public void SplitCsvTest()
{
var inventoryRecords = new List<InventoryCsvItem>();
for (int i = 0; i < 100000; i++)
{
inventoryRecords.Add(new InventoryCsvItem { ListPrice = i + 1, Quantity = i + 1 });
}
const decimal MAX_BYTES = 5 * 1024 * 1024; // 5 MB
List<byte[]> parts = new List<byte[]>();
using (var memoryStream = new MemoryStream())
{
using (var streamWriter = new StreamWriter(memoryStream))
using (var csvWriter = new CsvWriter(streamWriter))
{
csvWriter.WriteHeader<InventoryCsvItem>();
csvWriter.NextRecord();
csvWriter.Flush();
streamWriter.Flush();
var headerSize = memoryStream.Length;
foreach (var record in inventoryRecords)
{
csvWriter.WriteRecord(record);
csvWriter.NextRecord();
csvWriter.Flush();
streamWriter.Flush();
if (memoryStream.Length > (MAX_BYTES - headerSize))
{
parts.Add(memoryStream.ToArray());
memoryStream.SetLength(0);
memoryStream.Position = 0;
csvWriter.WriteHeader<InventoryCsvItem>();
csvWriter.NextRecord();
}
}
if (memoryStream.Length > headerSize)
{
parts.Add(memoryStream.ToArray());
}
}
}
for(int i = 0; i < parts.Count; i++)
{
var part = parts[i];
File.WriteAllBytes($"C:/Temp/Part {i + 1} of {parts.Count}.csv", part);
}
}

What is the best way to store a sbyte[,] to disk and back [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to store an sbyte[,] on my disk using as little space as possible (cant take more than a few seconds to save or load though) and get it back at a later time.
I can't serialize it to xml: Cannot serialize object of type System.SByte[,]. Multidimensional arrays are not supported.
And I can't convert it to a MemoryStream: Cannot convert from 'sbyte[,]' to 'int'
Besides creating a text file and looping it out piece by piece.. what options are there.
If it makes any difference the array can be upwards of 100 000 x 100 000 in size. The file also needs to be usable by different operating systems and computers.
Update.
I went with flattening my array down to a 1D sbyte[] and then converted the sbyte[] to a stream and saving it to disk along with a separate file containing the dimensions.
Stream stream = new MemoryStream(byteArray);
Used this as a base for saving the stream to disk. https://stackoverflow.com/a/5515894/937131
This is a testcase I wrote for the flattening and unflattening if anyone else finds it usefull.
[TestMethod]
public void sbyteTo1dThenBack()
{
sbyte[,] start = new sbyte[,]
{
{1, 2},
{3, 4},
{5, 6},
{7, 8},
{9, 10}
};
sbyte[] flattened = new sbyte[start.Length];
System.Buffer.BlockCopy(start, 0, flattened, 0, start.Length * sizeof(sbyte));
sbyte[,] andBackAgain = new sbyte[5, 2];
Buffer.BlockCopy(flattened, 0, andBackAgain, 0, flattened.Length * sizeof(sbyte));
var equal =
start.Rank == andBackAgain.Rank &&
Enumerable.Range(0, start.Rank).All(dimension => start.GetLength(dimension) == andBackAgain.GetLength(dimension)) &&
andBackAgain.Cast<sbyte>().SequenceEqual(andBackAgain.Cast<sbyte>());
Assert.IsTrue(equal);
}
As per my comments, I feel that writing out the byte array equivalents of everything is the way to go here. This may not be the most efficient way to do it, and lacks a lot of error handling code that you will need to supply, but, it works in my tests.
Edit: Also, BitConverter.ToInt32() may depend on the "Endianness" of your processor. See Scott Chamberlain's comments on how to fix this if you intend to use this code on ARM or other non-x86 systems.
public static class ArraySerializer
{
public static void SaveToDisk(string path, SByte[,] input)
{
var length = input.GetLength(1);
var height = input.GetLength(0);
using (var fileStream = File.OpenWrite(path))
{
fileStream.Write(BitConverter.GetBytes(length), 0, 4);//Store the length
fileStream.Write(BitConverter.GetBytes(height), 0, 4);//Store the height
var lineBuffer = new byte[length];
for (int h = 0; h < height; h++)
{
for (int l = 0; l < length; l++)
{
unchecked //Preserve sign bit
{
lineBuffer[l] = (byte)input[h,l];
}
}
fileStream.Write(lineBuffer,0,length);
}
}
}
public static SByte[,] ReadFromDisk(string path)
{
using (var fileStream = File.OpenRead(path))
{
int length;
int height;
var intBuffer = new byte[4];
fileStream.Read(intBuffer, 0, 4);
length = BitConverter.ToInt32(intBuffer, 0);
fileStream.Read(intBuffer, 0, 4);
height = BitConverter.ToInt32(intBuffer, 0);
var output = new SByte[height, length]; //Note, for large allocations, this can fail... Would fail regardless of how you read it back
var lineBuffer = new byte[length];
for (int h = 0; h < height; h++)
{
fileStream.Read(lineBuffer, 0, length);
for (int l = 0; l < length; l++)
unchecked //Preserve sign bit
{
output[h,l] = (SByte)lineBuffer[l];
}
}
return output;
}
}
}
Here's how I tested it:
void Main()
{
var test = new SByte[20000, 25000];
var length = test.GetLength(1);
var height = test.GetLength(0);
var lineBuffer = new byte[length];
var random = new Random();
//Populate with random data
for (int h = 0; h < height; h++)
{
random.NextBytes(lineBuffer);
for (int l = 0; l < length; l++)
{
unchecked //Let's use first bit as a sign bit for SByte
{
test[h,l] = (SByte)lineBuffer[l];
}
}
}
var sw = Stopwatch.StartNew();
ArraySerializer.SaveToDisk(#"c:\users\ed\desktop\test.bin", test);
Console.WriteLine(sw.Elapsed);
sw.Restart();
var test2 = ArraySerializer.ReadFromDisk(#"c:\users\ed\desktop\test.bin");
Console.WriteLine(sw.Elapsed);
Console.WriteLine(test.GetLength(0) == test2.GetLength(0));
Console.WriteLine(test.GetLength(1) == test2.GetLength(1));
Console.WriteLine(Enumerable.Cast<SByte>(test).SequenceEqual(Enumerable.Cast<SByte>(test2))); //Dirty hack to compare contents... takes a very long time
}
On my system (with an SSD), that test takes ~2.7s to write or read the contents of the 20kx25k array. To add compression, you can just wrap the FileStream in a GZipStream.

Intersect and Union in byte array of 2 files

I have 2 files.
1 is Source File and 2nd is Destination file.
Below is my code for Intersect and Union two file using byte array.
FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0; // actual number of bytes read
int subcount = 0;
while (frsrc.Read(src, 0, length) > 0)
{
try
{
Counter = 0;
frdes.Position = subcount * length;
while (frdes.Read(des, 0, length) > 0)
{
var data = src.Intersect(des);
var data1 = src.Union(des);
Counter++;
}
subcount++;
Console.WriteLine(subcount.ToString());
}
}
catch (Exception ex)
{
}
}
It is works fine with fastest speed.
but Now the problem is that I want count of it and when I Use below code then it becomes very slow.
var data = src.Intersect(des).Count();
var data1 = src.Union(des).Count();
So, Is there any solution for that ?
If yes,then please lete me know as soon as possible.
Thanks
Intersect and Union are not the fastest operations. The reason you see it being fast is that you never actually enumerate the results!
Both return an enumerable, not the actual results of the operation. You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". Now, when you do Count, you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!).
You'll need to make your own methods, most likely. You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table.
A few points: the comment // get file length is misleading as it is the buffer size. Counter is not the number of bytes read, it is the number of blocks read. data and data1 will end up with the result of the last block read, ignoring any data before them. That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors.
What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files.
It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.)
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var file1 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
var file2 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
List<string> files = new List<string> { file1, file2 };
var sw = System.Diagnostics.Stopwatch.StartNew();
// Prepare array of counters for the bytes
var nFiles = files.Count;
int[][] count = new int[nFiles][];
for (int i = 0; i < nFiles; i++)
{
count[i] = new int[256];
}
// Get the counts of bytes in each file
int bufLen = 32768;
byte[] buffer = new byte[bufLen];
int bytesRead;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
{
bytesRead = bufLen;
while (bytesRead > 0)
{
bytesRead = sr.Read(buffer, 0, bufLen);
for (int i = 0; i < bytesRead; i++)
{
count[fileNum][buffer[i]]++;
}
}
}
}
// Find which bytes are in any of the files or in all the files
var inAny = new List<byte>(); // union
var inAll = new List<byte>(); // intersect
for (int i = 0; i < 256; i++)
{
Boolean all = true;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
if (count[fileNum][i] > 0)
{
if (!inAny.Contains((byte)i)) // avoid adding same value more than once
{
inAny.Add((byte)i);
}
}
else
{
all = false;
}
};
if (all)
{
inAll.Add((byte)i);
};
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
// Display the results
Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
Console.WriteLine();
// Compare to using LINQ.
// N/B. Will need adjustments for more than two files.
var srcBytes1 = File.ReadAllBytes(file1);
var srcBytes2 = File.ReadAllBytes(file2);
sw.Restart();
var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);
Console.WriteLine(sw.ElapsedMilliseconds);
Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));
Console.ReadLine();
}
}
}
The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB).

How can i write and read using a BinaryWriter?

I have this code wich is working when writing a binary file:
using (BinaryWriter binWriter =
new BinaryWriter(File.Open(f.fileName, FileMode.Create)))
{
for (int i = 0; i < f.histogramValueList.Count; i++)
{
binWriter.Write(f.histogramValueList[(int)i]);
}
binWriter.Close();
}
And this code to read back from the DAT file on the hard disk:
fileName = Options_DB.get_histogramFileDirectory();
if (File.Exists(fileName))
{
BinaryReader binReader =
new BinaryReader(File.Open(fileName, FileMode.Open));
try
{
//byte[] testArray = new byte[3];
int pos = 0;
int length = (int)binReader.BaseStream.Length;
binReader.BaseStream.Seek(0, SeekOrigin.Begin);
while (pos < length)
{
long[] l = new long[256];
for (int i = 0; i < 256; i++)
{
if (pos < length)
l[i] = binReader.ReadInt64();
else
break;
pos += sizeof(Int64);
}
list_of_histograms.Add(l);
}
}
catch
{
}
finally
{
binReader.Close();
}
But what i want to do is to add to the Writing code to write to the file more three streams like this:
binWriter.Write(f.histogramValueList[(int)i]);
binWriter.Write(f.histogramValueListR[(int)i]);
binWriter.Write(f.histogramValueListG[(int)i]);
binWriter.Write(f.histogramValueListB[(int)i]);
But the first thing is how can i write all this and make it in the file it self to be identify by a string or something so when im reading the file back i will be able to put each List to a new one ?
Second thing is how do i read back the file now so each List will be added to a new List ?
Now it's easy im writing one List reading and adding it to a List.
But now i added more three Lists so how can i do it ?
Thanks.
To get answer think about how to get number of items in list that you've just serialized.
Cheat code: write number of items in collection before items. When reading do reverse.
writer.Write(items.Count());
// write items.Count() items.
Reading:
int count = reader.ReadInt32();
items = new List<ItemType>();
// read count item objects and add to items collection.

Categories

Resources