SUMMARY
in reading bytes from a file in chunks(not got a specific size between 128 - 1024, haven't decided yet) and i want to search the buffer to see if it contains a signature(pattern) of another byte array, and if it finds some of the pattern at the very end of the buffer it should read the next few bytes from the files to see if its found a match
What I've Tried
public static bool Contains(byte[] buffer, byte[] signiture, FileStream file)
{
for (var i = buffer.Length - 1; i >= signiture.Length - 1; i--) //move backwards through array stop if < signature
{
var found = true; //set found to true at start
for (var j = signiture.Length - 1; j >= 0 && found; j--) //loop backwards throughsignature
{
found = buffer[i - (signiture.Length - 1 - j)] == signiture[j];// compare signature's element with corresponding element of buffer
}
if (found)
return true; //if signature is found return true
}
//checking end of buffer for partial signiture
for (var x = signiture.Length - 1; x >= 1; x--)
{
if (buffer.Skip(buffer.Length - x).Take(x).SequenceEqual(signiture.Skip(0).Take(x))) //check if partial is equal to partial signiture
{
byte[] nextBytes = new byte[signiture.Length - x];
file.Read(nextBytes, 0, signiture.Length - x); //read next needed bytes from file
if (!signiture.Skip(0).Take(x).ToArray().Concat(nextBytes).SequenceEqual(signiture))
return false; //return false if not a match
return true; //return true if a match
}
}
return false; //if not found return false
}
This works but I've been told linq is slow and that i should use Array.IndexOf(). I've tried that but cant figure out how to implement it
You can make use of Span<T>, AsSpan and MemoryExtensions.SequenceEqual. The latter is not LINQ; it is optimized, especially for byte arrays. It unrolls the loop and uses unsafe code to essentially do a memcmp.
If you aren't using a framework that includes these types/methods by default, (.Netcore2.1+,. Netstandard 2.1) you can add the System.Memory nuget package. The implementation of SequenceEqual is a bit different (the so-called "slow version") but it is still faster than using LINQ's SequenceEqual.
Note that you also need to check the return value of FileStream.Read.
public static bool Contains(byte[] buffer, byte[] signiture, FileStream file)
{
var sigSpan = signiture.AsSpan();
//move backwards through buffer and check if signature found
for (var i = buffer.Length - signiture.Length; i >= 0; i--)
{
if (buffer.AsSpan(i, signiture.Length).SequenceEqual(sigSpan))
return true;
}
for (var x = signiture.Length - 1; x >= 1; x--)
{
var sig = sigSpan.Slice(0, x);
if (buffer.AsSpan(buffer.Length - x).SequenceEqual(sig)) //check if partial is equal to partial signiture
{
var sigLen = signiture.Length;
byte[] nextBytes = ArrayPool<byte>.Shared.Rent(sigLen - x);
// need to store number of bytes read
var read = file.Read(nextBytes, 0, sigLen - x); //read next needed bytes from file
var next = nextBytes.AsSpan(0, read);
// don't need to concat with signature, because obviously signature is going to
// start with signature.Skip(0).Take(...)
// just test that the number of bytes we read, plus the number we will skip equals
// the actual length, then check the remainder
var result = (read + x == signiture.Length
&& signiture.AsSpan(x).SequenceEqual(next));
ArrayPool<byte>.Shared.Return(nextBytes);
return result;
}
}
return false; //if not found return false
}
Related
I'm reading string data from inside a file. When I search the string data I read, the value I want does not seem to exist. Can you help with this topic?
The word I'm trying to search is: GTA:SA:MP
The code I use is:
static byte[] ReadFile(string filePath)
{
byte[] buffer;
FileStream fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
try
{
int length = (int)fileStream.Length; // get file length
buffer = new byte[length]; // create buffer
int count; // actual number of bytes read
int sum = 0; // total number of bytes read
// read until Read method returns 0 (end of the stream has been reached)
while ((count = fileStream.Read(buffer, sum, length - sum)) > 0)
sum += count; // sum is a buffer offset for next reading
}
finally
{
fileStream.Close();
}
return buffer;
}
static void Main(string[] args)
{
byte[] data = ReadFile(#"FILE.exe");
string result = Encoding.ASCII.GetString(data);
if (result.Contains("GTA:SA:MP"))
{
Console.WriteLine("Found");
}
else
{
Console.WriteLine("Not found");
}
Console.ReadLine();
}
The answer to me: Not found
You've got a couple problems. As others have pointed out if your source is bytes then you should compare bytes not strings. Otherwise you have encoding issues. Second issue is you're using a buffer but you're not checking for any boundary conditions - where the pattern you're searching for is split across the buffer size boundary. One simple way to do something like this is treat the source as a stream and just check byte by byte. I'll include an example using a simple state machine made from local functions.
I used the local functions just because it seemed fun, you can do this in a myriad of ways..
static void Main(string[] _)
{
byte[] target = Encoding.UTF8.GetBytes("2:30pm");
long offsetInSource = 0;
int indexOfTarget = 0;
long current = 0;
bool found = false;
Func<byte, byte, bool> match = CheckStart;
using (BinaryReader reader = new BinaryReader(File.Open("foo.txt", FileMode.Open)))
{
while (current < reader.BaseStream.Length)
{
var b = reader.ReadByte();
var t = target[indexOfTarget];
if (match(t, b))
{
found = true;
break;
}
++current;
}
}
if (found)
{
Console.WriteLine($"Found matching pattern at: {offsetInSource}");
}
else
{
Console.WriteLine("Did not find pattern");
}
bool CheckStart(byte t, byte b)
{
if (t == b)
{
offsetInSource = current;
if (++indexOfTarget == target.Length)
return true;
match = CheckRest;
}
return false;
}
bool CheckRest(byte t, byte b)
{
if (t == b)
{
if (++indexOfTarget == target.Length)
return true;
}
else
{
indexOfTarget = 0;
match = CheckStart;
}
return false;
}
}
}
If your file is huge, you can read file as text in 500 characters (for example) and store them into a string variable and search your phrase in this variable. If your phrase not found, read another 500 characters by 450 (500-50) offset and store them into a string variable and search your phrase in this variable. Do this loop until your phrase found or EOF reached.
I am currently working on writing my own disorganized P2P network (just for fun) and am trying to learn async. I want to call the following function which will return a byte[] chunk of the file being sent. Then, the main function will delegate the transmission of that chunk. Is it possible to have the function return data multiple times, waiting for the transmission of the previous chunk to finish before reading the next chunk into memory? I know I can write the function to simply be called several times with index-tracking arguments but would like to use async if possible since this seems like a perfect use-case.
public async Task<byte[]> ChunkData(string str)
{
byte[] tempData = new byte[BUFFER_SIZE]; //Stores one chunk of data
byte[] dataBytes = Encoding.Default.GetBytes(str); //Stores all data TODO: convert to filestream
int extraBytes = 0; //Track any extra bytes after we have exhausted full chunks
int dataChunks = dataBytes.Length / BUFFER_SIZE; //Determine how many chunks are in the file
for (int i = 0; i <= dataChunks; i++) //Loop for all dataChunks and once more for extraBytes
{
if (i == dataChunks) //Last iteration
{
extraBytes = dataBytes.Length - dataChunks * BUFFER_SIZE;
if (extraBytes == 0) break; //Break if no extra bytes
}
int endPoint = dataChunks > 0 ? (BUFFER_SIZE * (i != dataChunks ? (i + 1) : i)) : 0;
int frontPoint = i == 0 ? 0 : (BUFFER_SIZE * i);
tempData = dataBytes[frontPoint..(endPoint + extraBytes)]; //Assign tempData to range
Console.WriteLine(Encoding.Default.GetString(tempData)); //For debugging
}
}
I have 2 files.
1 is Source File and 2nd is Destination file.
Below is my code for Intersect and Union two file using byte array.
FileStream frsrc = new FileStream("Src.bin", FileMode.Open);
FileStream frdes = new FileStream("Des.bin", FileMode.Open);
int length = 24; // get file length
byte[] src = new byte[length];
byte[] des = new byte[length]; // create buffer
int Counter = 0; // actual number of bytes read
int subcount = 0;
while (frsrc.Read(src, 0, length) > 0)
{
try
{
Counter = 0;
frdes.Position = subcount * length;
while (frdes.Read(des, 0, length) > 0)
{
var data = src.Intersect(des);
var data1 = src.Union(des);
Counter++;
}
subcount++;
Console.WriteLine(subcount.ToString());
}
}
catch (Exception ex)
{
}
}
It is works fine with fastest speed.
but Now the problem is that I want count of it and when I Use below code then it becomes very slow.
var data = src.Intersect(des).Count();
var data1 = src.Union(des).Count();
So, Is there any solution for that ?
If yes,then please lete me know as soon as possible.
Thanks
Intersect and Union are not the fastest operations. The reason you see it being fast is that you never actually enumerate the results!
Both return an enumerable, not the actual results of the operation. You're supposed to go through that and enumerate the enumerable, otherwise nothing happens - this is called "deferred execution". Now, when you do Count, you actually enumerate the enumerable, and incur the full cost of the Intersect and Union - believe me, the Count itself is relatively trivial (though still an O(n) operation!).
You'll need to make your own methods, most likely. You want to avoid the enumerable overhead, and more importantly, you'll probably want a lookup table.
A few points: the comment // get file length is misleading as it is the buffer size. Counter is not the number of bytes read, it is the number of blocks read. data and data1 will end up with the result of the last block read, ignoring any data before them. That is assuming that nothing goes wrong in the while loop - you need to remove the try structure to see if there are any errors.
What you can do is count the number of occurences of each byte in each file, then if the count of a byte in any file is greater than one then it is is a member of the intersection of the files, and if the count of a byte in all the files is greater than one then it is a member of the union of the files.
It is just as easy to write the code for more than two files as it is for two files, whereas LINQ is easy for two but a little bit more fiddly for more than two. (I put in a comparison with using LINQ in a naïve fashion for only two files at the end.)
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var file1 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\Crysis3.exe"; // 26MB
var file2 = #"C:\Program Files (x86)\Electronic Arts\Crysis 3\Bin32\d3dcompiler_46.dll"; // 3MB
List<string> files = new List<string> { file1, file2 };
var sw = System.Diagnostics.Stopwatch.StartNew();
// Prepare array of counters for the bytes
var nFiles = files.Count;
int[][] count = new int[nFiles][];
for (int i = 0; i < nFiles; i++)
{
count[i] = new int[256];
}
// Get the counts of bytes in each file
int bufLen = 32768;
byte[] buffer = new byte[bufLen];
int bytesRead;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
using (var sr = new FileStream(files[fileNum], FileMode.Open, FileAccess.Read))
{
bytesRead = bufLen;
while (bytesRead > 0)
{
bytesRead = sr.Read(buffer, 0, bufLen);
for (int i = 0; i < bytesRead; i++)
{
count[fileNum][buffer[i]]++;
}
}
}
}
// Find which bytes are in any of the files or in all the files
var inAny = new List<byte>(); // union
var inAll = new List<byte>(); // intersect
for (int i = 0; i < 256; i++)
{
Boolean all = true;
for (int fileNum = 0; fileNum < nFiles; fileNum++)
{
if (count[fileNum][i] > 0)
{
if (!inAny.Contains((byte)i)) // avoid adding same value more than once
{
inAny.Add((byte)i);
}
}
else
{
all = false;
}
};
if (all)
{
inAll.Add((byte)i);
};
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
// Display the results
Console.WriteLine("Union: " + string.Join(",", inAny.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + string.Join(",", inAll.Select(x => x.ToString("X2"))));
Console.WriteLine();
// Compare to using LINQ.
// N/B. Will need adjustments for more than two files.
var srcBytes1 = File.ReadAllBytes(file1);
var srcBytes2 = File.ReadAllBytes(file2);
sw.Restart();
var intersect = srcBytes1.Intersect(srcBytes2).ToArray().OrderBy(x => x);
var union = srcBytes1.Union(srcBytes2).ToArray().OrderBy(x => x);
Console.WriteLine(sw.ElapsedMilliseconds);
Console.WriteLine("Union: " + String.Join(",", union.Select(x => x.ToString("X2"))));
Console.WriteLine();
Console.WriteLine("Intersect: " + String.Join(",", intersect.Select(x => x.ToString("X2"))));
Console.ReadLine();
}
}
}
The counting-the-byte-occurences method is roughly five times faster than the LINQ method on my computer, even without the latter loading the files and on a range of file sizes (a few KB to a few MB).
I am currently working in an environment where performance is critical and this is what I am doing :
var iso_8859_5 = System.Text.Encoding.GetEncoding("iso-8859-5");
var dataToSend = iso_8859_5.GetBytes(message);
The I need to group the bytes by 3 so I have a for loop that does this (i being the iterator of the loop):
byte[] dataByteArray = { dataToSend[i], dataToSend[i + 1], dataToSend[i + 2], 0 };
I then get an integer out of these 4 bytes
BitConverter.ToUInt32(dataByteArray, 0)
and finally the integer is converted to a hexadecimal string that I can place in a network packet.
The last two lines repeat about 150 times
I am currently hitting 50 milliseconds of execution times and ideally I would want to reach 0... Is there a faster way to do this that I am not aware of?
UPDATE
Just tried
string hex = BitConverter.ToString(dataByteArray);
hex.Replace("-", "")
to get the hex string directly but it is 3 times slower
Ricardo Silva's answer adapted
public byte[][] GetArrays(byte[] fullMessage, int size)
{
var returnArrays = new byte[(fullMessage.Length / size)+1][];
int i, j;
for (i = 0, j = 0; i < (fullMessage.Length - 2); i += size, j++)
{
returnArrays[j] = new byte[size + 1];
Buffer.BlockCopy(
src: fullMessage,
srcOffset: i,
dst: returnArrays[j],
dstOffset: 0,
count: size);
returnArrays[j][returnArrays[j].Length - 1] = 0x00;
}
switch ((fullMessage.Length % i))
{
case 0: {
returnArrays[j] = new byte[] { 0, 0, EOT, 0 };
} break;
case 1: {
returnArrays[j] = new byte[] { fullMessage[i], 0, EOT, 0 };
} break;
case 2: {
returnArrays[j] = new byte[] { fullMessage[i], fullMessage[i + 1], EOT, 0 };
} break;
}
return returnArrays;
}
After the line below you will get the total byte array.
var dataToSend = iso_8859_5.GetBytes(message);
My sugestion is work with Buffer.BlockCopy and test to see if this will be faster than your current method.
Try the code below and tell us if is faster than your current code:
public byte[][] GetArrays(byte[] fullMessage, int size)
{
var returnArrays = new byte[fullMessage.Length/size][];
for(int i = 0, j = 0; i < fullMessage.Length; i += size, j++)
{
returnArrays[j] = new byte[size + 1];
Buffer.BlockCopy(
src: fullMessage,
srcOffset: i,
dst: returnArrays[j],
dstOffset: 0,
count: size);
returnArrays[j][returnArrays[j].Length - 1] = 0x00;
}
return returnArrays;
}
EDIT1: I run the test below and the output was 245900ns (or 0,2459ms).
[TestClass()]
public class Form1Tests
{
[TestMethod()]
public void GetArraysTest()
{
var expected = new byte[] { 0x30, 0x31, 0x32, 0x00 };
var size = 3;
var stopWatch = new Stopwatch();
stopWatch.Start();
var iso_8859_5 = System.Text.Encoding.GetEncoding("iso-8859-5");
var target = iso_8859_5.GetBytes("012");
var arrays = Form1.GetArrays(target, size);
BitConverter.ToUInt32(arrays[0], 0);
stopWatch.Stop();
foreach(var array in arrays)
{
for(int i = 0; i < expected.Count(); i++)
{
Assert.AreEqual(expected[i], array[i]);
}
}
Console.WriteLine(string.Format("{0}ns", stopWatch.Elapsed.TotalMilliseconds * 1000000));
}
}
EDIT 2
I looked to your code and I have only one suggestion. I understood that you need to add EOF message and the length of input array will not be Always multiple of size that you want to break.
BUT, now the code below has TWO responsabilities, that break the S of SOLID concept.
The S talk about Single Responsability - Each method has ONE, and only ONE responsability.
The code you posted has TWO responsabilities (break input array into N smaller arrays and add EOF). Try think a way to create two totally independente methods (one to break an array into N other arrays, and other to put EOF in any array that you pass). This will allow you to create unit tests for each method (and guarantee that they Works and will never be breaked for any changed), and call the two methods from your class that make the system integration.
i've this piece of code from an open source c# program.
I'm trying to figure out the purpose behind this snippet.
internal static bool ReadAsDirectoryEntry(BinaryReader br)
{
bool dir;
br.BaseStream.Seek(8, SeekOrigin.Current);
dir = br.ReadInt32() < 0;
br.BaseStream.Seek(-12, SeekOrigin.Current);
return dir;
}
The code on LINE 6 is unclear to me , can anyone explain what it does ?
How can a bool have a value of the returned int32 and smaller than zero ?
Thanks!
You read an int and check if this int is smaller than 0. The expression br.ReadInt32() < 0 will result in a bool. This bool result you assign to your variable.
internal static bool ReadAsDirectoryEntry(BinaryReader br)
{
bool dir;
// Skip 8 bytes starting from current position
br.BaseStream.Seek(8, SeekOrigin.Current);
// Read next bytes as an Int32 which requires 32 bits (4 bytes) to be read
// Store whether or not this integer value is less then zero
// Possibly it is a flag which holds some information like if it is a directory entry or not
dir = br.ReadInt32() < 0;
// Till here, we read 12 bytes from stream. 8 for skipping + 4 for Int32
// So reset stream position to where it was before this method has called
br.BaseStream.Seek(-12, SeekOrigin.Current);
return dir;
}
basically, that is logically equivalent to (but terser than):
bool dir;
int tmp = br.ReadInt32();
if(tmp < 0)
{
dir = true;
}
else
{
dir = false;
}
It:
does the call to ReadInt32() (which will result in an int)
tests whether the result of that is < 0 (which will result in either true or false)
and assigns that result (true or false) to dir
To basically, it will return true if and only if the call to ReadInt32() gives a negative number.
The line 6 means : read an Int32 then compare it to 0 and then store the comparison result into a Boolean.
It's equivalent as :
Int32 tmp = br.ReadInt32();
dir = tmp < 0;