Intersect and Union in byte array of 2 files

Intersect and Union in byte array of 2 files - c#

I have 2 files.
1 is Source File and 2nd is Destination file.
Below is my code for Intersect and Union two file using byte array.
FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0; // actual number of bytes read
int subcount = 0;
while (frsrc.Read(src, 0, length) > 0)
{
try
{
Counter = 0;
frdes.Position = subcount * length;
while (frdes.Read(des, 0, length) > 0)
{
var data = src.Intersect(des);
var data1 = src.Union(des);
Counter++;
}
subcount++;
Console.WriteLine(subcount.ToString());
}
}
catch (Exception ex)
{
}
}
It is works fine with fastest speed.
but Now the problem is that I want count of it and when I Use below code then it becomes very slow.
var data = src.Intersect(des).Count();
var data1 = src.Union(des).Count();
So, Is there any solution for that ?
If yes,then please lete me know as soon as possible.
Thanks

Intersect and Union are not the fastest operations. The reason you see it being fast is that you never actually enumerate the results!
Both return an enumerable, not the actual results of the operation. You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". Now, when you do Count, you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!).
You'll need to make your own methods, most likely. You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table.

A few points: the comment // get file length is misleading as it is the buffer size. Counter is not the number of bytes read, it is the number of blocks read. data and data1 will end up with the result of the last block read, ignoring any data before them. That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors.
What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files.
It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.)
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var file1 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
var file2 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
List<string> files = new List<string> { file1, file2 };
var sw = System.Diagnostics.Stopwatch.StartNew();
// Prepare array of counters for the bytes
var nFiles = files.Count;
int[][] count = new int[nFiles][];
for (int i = 0; i < nFiles; i++)
{
count[i] = new int[256];
}
// Get the counts of bytes in each file
int bufLen = 32768;
byte[] buffer = new byte[bufLen];
int bytesRead;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
{
bytesRead = bufLen;
while (bytesRead > 0)
{
bytesRead = sr.Read(buffer, 0, bufLen);
for (int i = 0; i < bytesRead; i++)
{
count[fileNum][buffer[i]]++;
}
}
}
}
// Find which bytes are in any of the files or in all the files
var inAny = new List<byte>(); // union
var inAll = new List<byte>(); // intersect
for (int i = 0; i < 256; i++)
{
Boolean all = true;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
if (count[fileNum][i] > 0)
{
if (!inAny.Contains((byte)i)) // avoid adding same value more than once
{
inAny.Add((byte)i);
}
}
else
{
all = false;
}
};
if (all)
{
inAll.Add((byte)i);
};
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
// Display the results
Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
Console.WriteLine();
// Compare to using LINQ.
// N/B. Will need adjustments for more than two files.
var srcBytes1 = File.ReadAllBytes(file1);
var srcBytes2 = File.ReadAllBytes(file2);
sw.Restart();
var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);
Console.WriteLine(sw.ElapsedMilliseconds);
Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));
Console.ReadLine();
}
}
}
The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB).

Related

Convert byte array to array segments of a certain length

I have a byte array and I would like to return sequential chuncks (in the form of new byte arrays) of a certain size.
I tried:
originalArray = BYTE_ARRAY
var segment = new ArraySegment<byte>(originalArray,0,640);
byte[] newArray = new byte[640];
for (int i = segment.Offset; i <= segment.Count; i++)
{
newArray[i] = segment.Array[i];
}
Obviously this only creates an array of the first 640 bytes from the original array. Ultimately, I want a loop that goes through the first 640 bytes and returns an array of those bytes, then it goes through the NEXT 640 bytes and returns an array of THOSE bytes. The purpose of this is to send messages to a server and each message must contain 640 bytes. I cannot garauntee that the original array length is divisible by 640.
Thanks

if speed isn't a concern
var bytes = new byte[640 * 6];
for (var i = 0; i <= bytes.Length; i+=640)
{
var chunk = bytes.Skip(i).Take(640).ToArray();
...
}
Alternatively you could use
Span.Slice Method
Buffer.BlockCopy(Array, Int32, Array, Int32, Int32) Method
Span
Span<byte> bytes = arr; // Implicit cast from T[] to Span<T>
...
slicedBytes = bytes.Slice(i, 640);
BlockCopy
Note this will probably be the fastest of the 3
var chunk = new byte[640]
Buffer.BlockCopy(bytes, i, chunk, 0, 640);

If you truly want to make new arrays from each 640 byte chunk, then you're looking for .Skip and .Take
Here's a working example (and a repl of the example) that I hacked together.
using System;
using System.Linq;
using System.Text;
using System.Collections;
using System.Collections.Generic;
class MainClass {
public static void Main (string[] args) {
// mock up a byte array from something
var seedString = String.Join("", Enumerable.Range(0, 1024).Select(x => x.ToString()));
var byteArrayInput = Encoding.ASCII.GetBytes(seedString);
var skip = 0;
var take = 640;
var total = byteArrayInput.Length;
var output = new List<byte[]>();
while (skip + take < total) {
output.Add(byteArrayInput.Skip(skip).Take(take).ToArray());
skip += take;
}
output.ForEach(c => Console.WriteLine($"chunk: {BitConverter.ToString(c)}"));
}
}
It's really probably better to actually use the ArraySegment properly --unless this is an assignment to learn LINQ extensions.

You can write a generic helper method like this:
public static IEnumerable<T[]> AsBatches<T>(T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
Then you can use it in a foreach loop:
byte[] byteArray = new byte[123456];
foreach (var batch in AsBatches(byteArray, 640))
{
Console.WriteLine(batch.Length); // Do something with the batch.
}
Or if you want a list of batches just do this:
List<byte[]> listOfBatches = AsBatches(byteArray, 640).ToList();
If you want to get fancy you could make it an extension method, but this is only recommended if you will be using it a lot (don't make an extension method for something you'll only be calling in one place!).
Here I've changed the name to InChunksOf() to make it more readable:
public static class ArrayExt
{
public static IEnumerable<T[]> InChunksOf<T>(this T[] input, int n)
{
for (int i = 0, r = input.Length; r >= n; r -= n, i += n)
{
var result = new T[n];
Array.Copy(input, i, result, 0, n);
yield return result;
}
}
}
Which you could use like this:
byte[] byteArray = new byte[123456];
// ... initialise byteArray[], then:
var listOfChunks = byteArray.InChunksOf(640).ToList();
[EDIT] Corrected loop terminator from r > n to r >= n.

Faster way to generate random text file C#

The output should be a large text file, where each line has the form Number.String, text is random:
347. Bus
20175. Yes Yes
15. The same
2. Hello world
178. Tree
The file size must be specified in bytes. Interested in the fastest way to generate files of about 1000MB and more.
There is my code for generation random text:
public string[] GetRandomTextWithIndexes(int size)
{
var result = new string[size];
var sw = Stopwatch.StartNew();
var indexes = Enumerable.Range(0, size).AsParallel().OrderBy(g => GenerateRandomNumber(0, 5)).ToList();
sw.Stop();
Console.WriteLine("Queue fill: " + sw.Elapsed);
sw = Stopwatch.StartNew();
Parallel.For(0, size, i =>
{
var text = GetRandomText(GenerateRandomNumber(1, 20));
result[i] = $"{indexes[i]}. {text}";
});
sw.Stop();
Console.WriteLine("Text fill: " + sw.Elapsed);
return result;
}
public string GetRandomText(int size)
{
var builder = new StringBuilder();
for (var i = 0; i < size; i++)
{
var character = LegalCharacters[GenerateRandomNumber(0, LegalCharacters.Length)];
builder.Append(character);
}
return builder.ToString();
}
private int GenerateRandomNumber(int min, int max)
{
lock (_synlock)
{
if (_random == null)
_random = new Random();
return _random.Next(min, max);
}
}
I don't know how to make working this code not with size of strings but with size of MBs. When I set size to about 1000000000 I receive OutOfMemoryException. And maybe there is some faster way to generate indexes

Disk is your bottleneck, no need for parallel processing
No need to store everything in memory before writing
using (var fs = File.OpenWrite(#"c:\w\test.txt"))
using (var w = new StreamWriter(fs))
{
for (var i = 0; i < size; i++)
{
var text = GetRandomText(GenerateRandomNumber(1, 20));
var number = GenerateRandomNumber(0, 5);
var line = $"{number}. {text}";
w.WriteLine(line);
}
}

It's better to put the full exception in the question. I bet it shows at
var result = new string[size];
1000000000 for size of string array is too much, try to run this dotnetfiddle, you'll get:
Run-time exception (line 12): Array dimensions exceeded supported
range.
Stack Trace:
[System.OutOfMemoryException: Array dimensions exceeded supported
range.] at Program.Main() :line 12
Please have a look at the following to know why you are getting that exception and what's the workaround.
What is the Maximum Size that an Array can hold?
Can't create huge arrays
Error when Dictionary count is bigger as 89478457

CSV File Splitting with specific size

Hi guys I've a function which will create multiple CSV files from a DataTable in smaller chunks based on size passed through app.config key/value pair.
Issues with the code below:
I've hardcoded the file size to 1 kb, when I'll pass a value of 20, it should created csv file of 20kb. Currently it's creating a file size of 5kb for the same value.
For the last left records it's not creating any file.
Kindly help me to fix this. Thanks!
code :
public static void CreateCSVFile(DataTable dt, string CSVFileName)
{
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024; //1 KB size
string CSVPath = ConfigurationManager.AppSettings["CSVPath"];
StringBuilder FirstLine = new StringBuilder();
StringBuilder records = new StringBuilder();
int num = 0;
int length = 0;
IEnumerable<string> columnNames = dt.Columns.Cast<DataColumn>().Select(column => column.ColumnName);
FirstLine.AppendLine(string.Join(",", columnNames));
records.AppendLine(FirstLine.ToString());
length += records.ToString().Length;
foreach (DataRow row in dt.Rows)
{
//Putting field values in double quotes
IEnumerable<string> fields = row.ItemArray.Select(field =>
string.Concat("\"", field.ToString().Replace("\"", "\"\""), "\""));
records.AppendLine(string.Join(",", fields));
length += records.ToString().Length;
if (length > size)
{
//Create a new file
num++;
File.WriteAllText(CSVPath + CSVFileName + DateTime.Now.ToString("yyyyMMddHHmmss") + num.ToString("_000") + ".csv", records.ToString());
records.Clear();
length = 0;
records.AppendLine(FirstLine.ToString());
}
}
}

Use File.ReadLines, Linq means deferred execution will be performed.
foreach(var line in File.ReadLines(FilePath))
{
// logic here.
}
From MSDN
The ReadLines and ReadAllLines methods differ as follows: When you use
ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array. Therefore, when you are working with very large files,
ReadLines can be more efficient.
Now so, you could rewrite your method as below.
public static void SplitCSV(string FilePath, string FileName)
{
//Read Specified file size
int size = Int32.Parse(ConfigurationManager.AppSettings["FileSize"]);
size *= 1024 * 1024; //1 MB size
int total = 0;
int num = 0;
string FirstLine = null; // header to new file
var writer = new StreamWriter(GetFileName(FileName, num));
// Loop through all source lines
foreach (var line in File.ReadLines(FilePath))
{
if (string.IsNullOrEmpty(FirstLine)) FirstLine = line;
// Length of current line
int length = line.Length;
// See if adding this line would exceed the size threshold
if (total + length >= size)
{
// Create a new file
num++;
total = 0;
writer.Dispose();
writer = new StreamWriter(GetFileName(FileName, num));
writer.WriteLine(FirstLine);
length += FirstLine.Length;
}
// Write the line to the current file
writer.WriteLine(line);
// Add length of line in bytes to running size
total += length;
// Add size of newlines
total += Environment.NewLine.Length;
}
}

The solution is quite simple... you don't need to put all your lines in memory (as you do in string[] arr = File.ReadAllLines(FilePath);).
Instead, create an StreamReader on the input file, and read line by line to line buffer. When the buffer is over your "threshold size", write it to disk into a single csv file. The code should be something like this:
using (var sr = new System.IO.StreamReader(filePath))
{
var linesBuffer = new List<string>();
while (sr.Peek() >= 0)
{
linesBuffer.Add(sr.ReadLine());
if (linesBuffer.Count > yourThreshold)
{
// TODO: implement function WriteLinesToPartialCsv
WriteLinesToPartialCsv(linesBuffer);
// Clear the buffer:
linesBuffer.Clear();
// Try forcing c# to clear the memory:
GC.Collect();
}
}
}
As you can see, reading the stream line by line (instead of the whole CSV inpunt file, as your code did) you have better control over the memory.

Byte[] Split Up Into 1022 Length Sections Then Restored Not Matching Up

I am trying to take a PDF document and upload it via a MVC website to be stored into an SAP structure. The SAP structure requires the byte array to be broken up into 1022 length sections. The program seems to work good up to the point where I try to view the PDF document out of SAP. Unfortunately, I cannot view the PDF data stored in SAP due to access rights. So, I created a sort of MOCK program to match up the byte array from before it is sent to SAP (fileContent) and then what it should look like once it is returned from SAP (fileContentPostSAP).
The program compares the byte arrays and finds mismatching values at array location 1022.
Is there a bug in my program that is causing the byte arrays to not match? They are supposed to match exactly, right?
ClaimsIdentityMgr claimIdentityMgr = new ClaimsIdentityMgr();
ClaimsIdentity currentClaimsIdentity = claimIdentityMgr.GetCurrentClaimsIdentity();
var subPath = "~/App_Data/" + currentClaimsIdentity.EmailAddress;
var destinationPath = Path.Combine(Server.MapPath(subPath), "LG WM3455H Spec Sheet.pdf");
byte[] fileContent = System.IO.File.ReadAllBytes(destinationPath);
//pretend this is going to SAP
var arrList = SAPServiceRequestRepository.CreateByteListForStructure(fileContent);
var mockStructureList = new List<byte[]>();
foreach (byte[] b in arrList)
mockStructureList.Add(b);
//now get it back from Mock SAP
var fileContentPostSAP = new byte[fileContent.Count()];
var rowCounter = 0;
var prevLength = 0;
foreach (var item in mockStructureList)
{
if (rowCounter == 0)
System.Buffer.BlockCopy(item, 0, fileContentPostSAP, 0, item.Length);
else
System.Buffer.BlockCopy(item, 0, fileContentPostSAP, prevLength, item.Length);
rowCounter++;
prevLength = item.Length;
}
//compare the orginal array with the new one
var areEqual = (fileContent == fileContentPostSAP);
for (var i = 0; i < fileContent.Length; i++)
{
if (fileContent[i] != fileContentPostSAP[i])
throw new Exception("i = " + i + " | fileContent[i] = " + fileContent[i] + " | fileContentPostSAP[i] = " + fileContentPostSAP[i]);
}
And here is the CreateByteListForStructure function:
public static List<byte[]> CreateByteListForStructure(byte[] fileContent)
{
var returnList = new List<byte[]>();
for (var i = 0; i < fileContent.Length; i += 1022)
{
if (fileContent.Length - i >= 1022)
{
var localByteArray = new byte[1022];
System.Buffer.BlockCopy(fileContent, i, localByteArray, 0, 1022);
returnList.Add(localByteArray);
}
else
{
var localByteArray = new byte[fileContent.Length - i];
System.Buffer.BlockCopy(fileContent, i, localByteArray, 0, fileContent.Length - i);
returnList.Add(localByteArray);
}
}
return returnList;
}

There seems to be a simple bug in the code.
This loop, which reconstructs the contents of the array from the blocks:
var prevLength = 0;
foreach (var item in mockStructureList)
{
if (rowCounter == 0)
System.Buffer.BlockCopy(item, 0, fileContentPostSAP, 0, item.Length);
else
System.Buffer.BlockCopy(item, 0, fileContentPostSAP, prevLength, item.Length);
rowCounter++;
prevLength = item.Length;
}
By the description of the blocks, every block is 1022 bytes, which means that after the first iteration, prevLength is set to 1022, but after the next iteration it is set to 1022 again.
The more correct assignment of prevLength would be this:
prevLength += item.Length;
^
|
+-- added this
This will correctly move the pointer in the output array forward one block at a time, instead of moving it to the second block and then leaving it there.
Basically you write block 0 in the correct place, but all the other blocks on top of block 1, leaving block 2 and onwards in the output array as zeroes.

How can i write and read using a BinaryWriter?

I have this code wich is working when writing a binary file:
using (BinaryWriter binWriter =
new BinaryWriter(File.Open(f.fileName, FileMode.Create)))
{
for (int i = 0; i < f.histogramValueList.Count; i++)
{
binWriter.Write(f.histogramValueList[(int)i]);
}
binWriter.Close();
}
And this code to read back from the DAT file on the hard disk:
fileName = Options_DB.get_histogramFileDirectory();
if (File.Exists(fileName))
{
BinaryReader binReader =
new BinaryReader(File.Open(fileName, FileMode.Open));
try
{
//byte[] testArray = new byte[3];
int pos = 0;
int length = (int)binReader.BaseStream.Length;
binReader.BaseStream.Seek(0, SeekOrigin.Begin);
while (pos < length)
{
long[] l = new long[256];
for (int i = 0; i < 256; i++)
{
if (pos < length)
l[i] = binReader.ReadInt64();
else
break;
pos += sizeof(Int64);
}
list_of_histograms.Add(l);
}
}
catch
{
}
finally
{
binReader.Close();
}
But what i want to do is to add to the Writing code to write to the file more three streams like this:
binWriter.Write(f.histogramValueList[(int)i]);
binWriter.Write(f.histogramValueListR[(int)i]);
binWriter.Write(f.histogramValueListG[(int)i]);
binWriter.Write(f.histogramValueListB[(int)i]);
But the first thing is how can i write all this and make it in the file it self to be identify by a string or something so when im reading the file back i will be able to put each List to a new one ?
Second thing is how do i read back the file now so each List will be added to a new List ?
Now it's easy im writing one List reading and adding it to a List.
But now i added more three Lists so how can i do it ?
Thanks.

To get answer think about how to get number of items in list that you've just serialized.
Cheat code: write number of items in collection before items. When reading do reverse.
writer.Write(items.Count());
// write items.Count() items.
Reading:
int count = reader.ReadInt32();
items = new List<ItemType>();
// read count item objects and add to items collection.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Intersect and Union in byte array of 2 files - c#

Related

Convert byte array to array segments of a certain length

Faster way to generate random text file C#

CSV File Splitting with specific size

Byte[] Split Up Into 1022 Length Sections Then Restored Not Matching Up

How can i write and read using a BinaryWriter?

Categories

Resources