ensure two char arrays are not the same - c#

I am randomly generating a grid of characters and storing it in a char[,] array ...
I need a way to ensure that i haven't already generated a grid before serializing it to a database in binary format...what is the best way to compare two grids based on bytes? The last thing i want to do is loop through their contents as I am already pulling one of them from the db in byte form.
I was thinking checksum but not so sure if this would work.
char[,] grid = new char[8,8];
char[,] secondgrid = new char[8,8];//gets its data from db

From what I can see, you are going to have to loop over the contents (or at least, a portion of it); there is no other way of talking about an arrays contents.
Well, as a fast "definitely not the same" you could compute a hash over the array - i.e. something like:
int hash = 7;
foreach (char c in data) {
hash = (hash * 17) + c.GetHashCode();
}
This has the risk of some false positives (reporting a dup when it is unique), but is otherwise quite cheap. Any use? You could store the hash alongside the data in the database to allow fast checks - but if you do that you should pick your own hash algorithm for char (since it isn't guaranteed to stay the same) - perhaps just convert to an int, for example - or to re-use the existing implementation:
int hash = 7;
foreach (char c in data) {
hash = (hash * 17) + (c | (c << 0x10));
}
As an aside - for 8x8, you could always just think in terms of a 64 character string, and just check ==. This would work equally well at the database and application.

Can't you get the database to do it? Make the grid column UNIQUE. Then, if you need to detect that you've generated a duplicate grid, the method for doing this might involve checking the number of rows affected by your operation, or perhaps testing for errors.
Also, if each byte is simply picked at random from [0, 255], then performing a hash to get a 4-byte number is no better than taking the first four bytes out of the grid. The chance of collisions is the same.

I'd go with a checksum/hash mechanism to catch a large percentage of the matches, then do a full comparison if you get a match.
What is the range of characters used to fill in your grid? If you're using just letters (not mixed case, or case not important), and an 8x8 grid, you're only talking about 7 or so possible collisions per item within your problem space (a very rare occurence) assuming a good hashing function. You could do something like:
Generate Grid
Load any matching grids from DB
if found match from #2, goto 1
Use your new grid.

Try this (invoke ComputeHash for every matrix and compare the guids):
private static MD5 md5 = MD5.Create();
public static Guid ComputeHash(object value)
{
Guid g = Guid.Empty;
BinaryFormatter bf = new BinaryFormatter();
using (MemoryStream stm = new MemoryStream())
{
bf.Serialize(stm, value);
g = new Guid(md5.ComputeHash(stm.ToArray()));
stm.Close();
}
return g;
}
note: Generating the byte array might be accomplished a lot simpler since you have a char array.

Related

Most efficient way to store and retrieve a 512-bit number?

I have a String of 512 characters that contains only 0, 1. I'm trying to represent it into a data structure that can save the space. Is BitArray the most efficient way?
I'm also thinking about using 16 int32 to store the number, which would then be 16 * 4 = 64 bytes.
Most efficient can mean many different things...
Most efficient from a memory management perspective?
Most efficient from a CPU calculation perspective?
Most efficient from a usage perspective? (In respect to writing code that uses the numbers for calculations)
For 1 - use byte[64] or long[8] - if you aren't doing calculations or don't mind writing your own calculations.
For 3 definitely BigInteger is the way to go. You have your math functions already defined and you just need to turn your binary number into a decimal representation.
EDIT: Sounds like you don't want BigInteger due to size concerns... however I think you are going to find that you will of course have to parse this as an enumerable / yield combo where you are parsing it a bit at a time and don't hold the entire data structure in memory at the same time.
That being said... I can help you somewhat with parsing your string into array's of Int64's... Thanks King King for part of this linq statement here.
// convert string into an array of int64's
// Note that MSB is in result[0]
var result = input.Select((x, i) => i)
.Where(i => i % 64 == 0)
.Select(i => input.Substring(i, input.Length - i >= 64 ?
64 : input.Length - i))
.Select(x => Convert.ToUInt64(x, 2))
.ToArray();
If you decide you want a different array structure byte[64] or whatever it should be easy to modify.
EDIT 2: OK I got bored so I wrote an EditDifference function for fun... here you go...
static public int GetEditDistance(ulong[] first, ulong[] second)
{
int editDifference = 0;
var smallestArraySize = Math.Min(first.Length, second.Length);
for (var i = 0; i < smallestArraySize; i++)
{
long signedDifference;
var f = first[i];
var s = second[i];
var biggest = Math.Max(f, s);
var smallest = Math.Min(f, s);
var difference = biggest - smallest;
if (difference > long.MaxValue)
{
editDifference += 1;
signedDifference = Convert.ToInt64(difference - long.MaxValue - 1);
}
else
signedDifference = Convert.ToInt64(difference);
editDifference += Convert.ToString(signedDifference, 2)
.Count(x => x == '1');
}
// if arrays are different sizes every bit is considered to be different
var differenceOfArraySize =
Math.Max(first.Length, second.Length) - smallestArraySize;
if (differenceOfArraySize > 0)
editDifference += differenceOfArraySize * 64;
return editDifference;
}
Use BigInteger from .NET. It can easily support 512-bit numbers as well as operations on those numbers.
BigInteger.Parse("your huge number");
BitArray (with 512 bits), byte[64], int[16], long[8] (or List<> variants of those), or BigInteger will all be much more efficient than your String. I'd say that byte[] is the most idiomatic/typical way of representing data such as this, in general. For example, ComputeHash uses byte[] and Streams deal with byte[]s, and if you store this data as a BLOB in a DB, byte[] will be the most natural way to work with that data. For that reason, it'd probably make sense to use this.
On the other hand, if this data represents a number that you might do numeric things to like addition and subtraction, you probably want to use a BigInteger.
These approaches have roughly the same performance as each other, so you should choose between them based primarily on things like what makes sense, and secondarily on performance benchmarked in your usage.
The most efficient would be having eight UInt64/ulong or Int64/long typed variables (or a single array), although this might not be optimal for querying/setting. One way to get around this is, indeed, to use a BitArray (which is basically a wrapper around the former method, including additional overhead [1]). It's a matter of choice either for easy use or efficient storage.
If this isn't sufficient, you can always choose to apply compression, such as RLE-encoding or various other widely available encoding methods (gzip/bzip/etc...). This will require additional processing power though.
It depends on your definition of efficient.
[1] Addtional overhead, as in storage overhead. BitArray internally uses an Int32-array to store values. In addition to that BitArray stores its current mutation version, the number of ints 'allocated' and a syncroot. Even though the overhead is negligible for smaller amount of values, it can be an issue if you keep a lot of these in memory.

Best way to store / retrieve bits C# [duplicate]

This question already has answers here:
Best way to store long binary (up to 512 bit) in C#
(5 answers)
Closed 9 years ago.
I am modifying an existing C# solution, wherein data is validated and status is stored as below:
a) A given record is validated against certain no. of conditions (say 5). Failed / passed status is represented by a bit value (0 - passed; 1 - failed)
b) So, if a record failed for all 5 validations, value will be 11111. This will be converted to a decimal and stored in a DB.
Once again, this decimal value will be converted back to binary (using bitwise & operator) which will be used to show the passed / failed records.
The issue is, long datatype is used in C# to handle the decimal and 'decimal' datatype in SQL Server 2008 to store this decimal value. The max. value of long converted to binary can hold only upto 64 bits, so validation count is currently restricted to 64.
My requirement is to remove this limit to allow any no. of validations.
How do I store a large no. of bits and also retrieve them? Also, please keep in mind, this being an existing (.NET 2.0) solution, can't afford to upgrade / use any 3rd party libraries and changes must be minimum
Latest update
Yes, this solution seems to be OK from an application perspective, i.e. if only I (a.k.a the present solution) were to use only C#. However, the designers of the existing solution made things complicated by storing the binary value (11111 represents all 5 records failed, 10111 - all but 4th record failed, and so on...) converted into decimal in SQL Server DB. An SP takes this value to arrive at no. of records failed for e each validation.
OPEN sValidateCUR
FETCH NEXT FROM sValidateCUR INTO #ValidationOID,#ValidationBit, #ValidationType
WHILE ##FETCH_STATUS = 0
BEGIN
-- Fetch the Error Record Count
SET #nBitVal = ABS(RPT.fGetPowerValue(#ValidationBit)) -- Validation bit is no. of a type of validation, say, e.g. 60. So 1st time when loop run, ValidationBit will be 0
select #ErrorRecordCount = COUNT(1) FROM
<<Error_Table_Where_Flags_are availble in decimal values>>
WITH(NOLOCK) WHERE ExpressionValidationFlags & CAST(CAST(#nBitVal AS VARCHAR(20)) AS Bigint) = CAST(#nBitVal AS VARCHAR(20)) -- For #ValidationBit = 3, #nBitVal = 2^3 = 8
Now, in application, using BitArray, I managed to stored the passed / failed records in BitArray, converted this to byte[] and stored in SQL Server as VARBINARY(100)... (the same column ExpressionValidationFlags, which was earlier BIGINT, is now VARBINARY and holds the byte array). However, to complete my changes, I need to modify the SP above.
Again, looking forward for help!!
Thanks
Why not use a specially designed class BitArray?
http://msdn.microsoft.com/query/dev11.query?appId=Dev11IDEF1&l=EN-US&k=k(System.Collections.BitArray);k(TargetFrameworkMoniker-.NETFramework,Version%3Dv4.5);k(DevLang-csharp)&rd=true
e.g.
BitArray array = new BitArray(150); // <- up to 150 bits
...
array[140] = true; // <- set 140th bit
array[130] = false; // <- reset 130th bit
...
if (array[120]) { // <- if 120th bit is set
...
There are several ways to go about this, based on the limitations of the database you are using.
If you are able to store byte arrays within the database, you can use the BitArray class. You can pass the constructor a byte array, use it to easily check and set each bit by index, and then use it's built in CopyTo method to copy it back out into a byte array.
Example:
byte[] statusBytes = yourDatabase.Get("passed_failed_bits");
BitArray statusBits = new BitArray(statusBytes);
...
statusBits[65] = false;
statusBits[66] = true;
...
statusBits.CopyTo(statusBytes, 0);
yourDatabase.Set("passed_failed_bits", statusBytes);
If the database is unable to deal with raw byte arrays, you can always encode the byte array as a hex string:
string hex = BitConverter.ToString(statusBytes);
hex.Replace("-","");
and then get it back into a byte array again:
int numberChars = hex.Length;
byte[] statusBytes= new byte[numberChars / 2];
for (int i = 0; i < numberChars; i += 2) {
statusBytes[i / 2] = Convert.ToByte(hex.Substring(i, 2), 16);
}
And if you can't even store strings, there are more creative ways to turn the byte array into multiple longs or doubles.
Also, if space efficiency is an issue, there are other, more efficient (but more complicated) ways to encode bytes as ascii text by using more of the character range without using control characters. You may also want to look into Run Length Encoding the byte array, if you're finding the data remains the same value for long stretches.
Hope this helps!
Why not use a string instead? You could put a very large number of characters in the database (use VARCHAR and not NVARCHAR since you control the input).
Using your example, if you had "11111", you could skip bitwise operations and just do things like this:
string myBits = "11111";
bool failedPosition0 = myBits[0] == '1';
bool failedPosition1 = myBits[1] == '1';
bool failedPosition2 = myBits[2] == '1';
bool failedPosition3 = myBits[3] == '1';
bool failedPosition4 = myBits[4] == '1';

Fast byte array masking in C#

I have a struct with some properties (like int A1, int A2,...). I store a list of struct as binary in a file.
Now, I'm reading the bytes from file using binary reader into Buffer and I want to apply a filter based on the struct's properties (like .A1 = 100 & .A2 = 12).
The performance is very important in my scenario, so I convert the filter criteria to byte array (Filter) and then I want to mask Buffer with Filter. If the result of masking is equal to Filter, the Buffer will be converted to the struct.
The question: What is the fastest way to mask and compare two byte arrays?
Update: The Buffer size is more than 256 bytes. I'm wondering if there is a better way rather than iterating in each byte of Buffer and Filter.
The way I would usually approach this is with unsafe code. You can use the fixed keyword to get a byte[] as a long*, which you can then iterate in 1/8th of the iterations - but using the same bit operations. You will typically have a few bytes left over (from it not being an exact multiple of 8 bytes) - just clean those up manually afterwards.
Try a simple loop with System.BitConverter.ToInt64(). Something Like this:
byte[] arr1;
byte[] arr2;
for (i = 0; i < arr1.Length; i += 8)
{
var P1 = System.BitConverter.ToInt64(arr1, i);
var P2 = System.BitConverter.ToInt64(arr2, i);
if((P1 & P2) != P1) //or whatever
//break the loop if you need to.
}
My assumption is that comparing/masking two Int64s will be much faster (especially on 64-bit machines) than masking one byte at a time.
Once you've got the two arrays - one from reading the file and one from the filter, all you then need is a fast comparison for the arrays. Check out the following postings which are using unsafe or PInvoke methods.
What is the fastest way to compare two byte arrays?
Comparing two byte arrays in .NET

Another class instead of SHA1Managed to making Checksum's with fewer than 128 length bytes

i have a table that have one column (AbsoluteUrl NVARCHAR(2048)) and i want to querying on this column, so this took long time to comparing each records with my own string. at least this table have 1000000 records.
Now i think there is better solution to making a checksum for each AbsoluteUrl and compare to checksum together instead of to AbsoluteUrl column. so i'm use below method to generate checksum. but i want another class to making checksum's with fewer than 128 length bytes.
public static byte[] GenerateChecksumAsByte(string content)
{
var buffer = Encoding.UTF8.GetBytes(content);
return new SHA1Managed().ComputeHash(buffer);
}
And is this approach good for my work?
UPDATE
According to answers, i want to explain in more depth. so actually I'm work on very simple Web Search Engine. If I want to briefly explain that I have to say when all of urls of web page are extracted (collection of found urls) then I'm going to index that to Urls table.
UrlId uniqueidentifier NotNull Primary Key (Clustered Index)
AbsoluteUrl nvarchar(2048) NoyNull
Checksum varbinary(128) NotNull
So i first search the table to if i have same url which is indexed before or not. if not then create new record.
public Url Get(byte[] checksum)
{
return _dataContext.Urls.SingleOrDefault(url => url.Checksum == checksum);
//Or querying by AbsoluteUrl field
}
And Save method.
public void Save(Url url)
{
if (url == null)
throw new ArgumentNullException("url");
var origin = _dataContext.Urls.GetOriginalEntityState(url);
if (origin == null)
{
_dataContext.Urls.Attach(url);
_dataContext.Refresh(RefreshMode.KeepCurrentValues, url);
}
else
_dataContext.Urls.InsertOnSubmit(url);
_dataContext.SubmitChanges();
}
For example if on one page i found 2000 urls, i must search for 2000 times.
You want to use a hash of size (p) as a key, expecting at most 1m records (u). To answer this question you have to first do the math...
Solve the following for each hash size to consider: 1 - e ^ (-u^2 / (2 * p))
32-bit: 100% chance of collision
64-bit: 0.00000271% chance of collision
128-bit: 0% (too small to calculate with a double precision)
Now you should have enough information to make an informed decision. Here is the code to produce the above calculation on the 64-bit key:
double keySize = 64;
double possibleKeys = Math.Pow(2, keySize);
double universeSize = 1000000;
double v1, v2;
v1 = -Math.Pow(universeSize, 2);
v2 = 2.0 * possibleKeys;
v1 = v1 / v2;
v1 = Math.Pow(2.718281828, v1);
v1 = 1.0 - v1;
Console.WriteLine("The resulting percentage is {0:n40}%", v1 * 100.0);
Personally I'd stick with at least a 128 bit hash myself. Moreover if collisions can cause any form of a security hole you need to use at least a v2 SHA hash (SHA256/SHA512).
Now, If this is just an optimization for the database consider the following:
add a 32-bit hash code to the table.
create a composite key containing both the 32-bit hash AND the original string.
ALWAYS seek on both the hash and the original string.
Assume the hash is only an optimization and never unique.
I agree with Steven that you should first try an index on the field to see if it really is "comparing each records" that is the bottleneck.
However, depending on your database, indexing an NVARCHAR(2048) may not be possible, and really could be the bottleneck. In that case generating checksums actually could improve your search performance if:
You do many more comparisons than inserts.
Comparing the checksum is faster than comparing NVARCHARs.
Most of your checksums are different.
You have not shown us any queries or sample data, so I have no way of knowing if these are true. If they are true, you can indeed improve performance by generating a checksum for each AbsoluteUrl and assuming values are different where these checksums are different. If the checksums are the same, you will have to do a string comparison to see if values match, but if checksums are different you can be sure the strings are different.
In this case a cryptographic checksum is not necessary, you can use a smaller, faster checksum algorithm like CRC64.
As Steven points out, if your checksums are the same you cannot assume your values are the same. However, if most of your values are different and you have a good checksum, most of your checksums will be different and will not require string comparisons.
No, this is not a good approach.
A million records is no big deal for an indexed field. On the other hand, any checksum/hash/whatever you generate is capable of false positives due to the pigeonhole principle (aka birthday paradox). Making it bigger reduces but does not eliminate this chance, but it does slow things down to the point where there will be no speed increase.
Just slap an index on the field and see what happens.

Convert ten character classification string into four character one in C#

What's the best way to convert (to hash) a string like 3800290030, which represents an id for a classification into a four character one like 3450 (I need to support at max 9999 classes). We will only have less than 1000 classes in 10 character space and it will never grow to more than 10k.
The hash needs to be unique and always the same for the same an input.
The resulting string should be numeric (but it will be saved as char(4) in SQL Server).
I removed the requirement for reversibility.
This is my solution, please comment:
string classTIC = "3254002092";
MD5 md5Hasher = MD5.Create();
byte[] classHash = md5Hasher.ComputeHash(Encoding.Default.GetBytes(classTIC));
StringBuilder sBuilder = new StringBuilder();
foreach (byte b in classHash)
{
sBuilder.Append(b.ToString());
}
string newClass = (double.Parse(sBuilder.ToString())%9999 + 1).ToString();
You can do something like
str.GetHashCode() % 9999 + 1;
The hash can't be unique since you have more than 9,999 strings
It is not unique so it cannot be reversible
and of course my answer is wrong in case you don't have more than 9999 different 10 character classes.
In case you don't have more than 9999 classes you need to have a mapping from string id to its 4 char representation - for example - save the stings in a list and each string key will be its index in the list
When you want to reverse the process, and have no knowledge about the id's apart from that there are at most 9999 of them, I think you need to use a translation dictionary to map each id to its short version.
Even without the need to reverse the process, I don't think there is a way to guerantee unique id's without such a dictionary.
This short version could then simply be incremented by one with each new id.
You do not want a hash. Hashing by design allows for collisions. There is no possible hashing function for the kind of strings you work with that won't have collisions.
You need to build a persistent mapping table to convert the string to a number. Logically similar to a Dictionary<string, int>. The first string you'll add gets number 0. When you need to map, look up the string and return its associate number. If it is not present then add the string and simply assign it a number equal to the count.
Making this mapping table persistent is what you'll need to think about. Trivially done with a dbase of course.
ehn no idea
Unique is difficult, you have - in your request - 4 characters - thats a max of 9999, collision will occur.
Hash is not reversible. Data is lost (obviously).
I think you might need to create and store a lookup table to be able to support your requirements. And in that case you don't even need a hash you could just increment the last used 4 digit lookup code.
use md5 or sha like:
string = substring(md5("05910395410"),0,4)
or write your own simple method, for example
sum = 0
foreach(char c in string)
{
sum+=(int)c;
}
sum %= 9999
Convert the number to base35/base36
ex: 3800290030 decimal = 22CGHK5 base-35 //length: 7
Or may be convert to Base60 [ignoring Capital O and small o to not confuse with 0]
ex: 3800290030 decimal = 4tDw7A base-60 //length: 6
Convert your int to binary and then base64 encode it. It wont be numbers then, but it will be a reversible hash.
Edit:
As far as my sense tells me you are asking for the impossible.
You cannot take a totally random data and somehow reduce the amount of data it takes to encode it (some might be shorter, others might be longer), thus your requirement that the number is unique is not possible, there has to be some dataloss somewhere and no matter how you do it it won't ensure uniqueness.
Second, due to the above it is also not possible to make it reversible. Thus that is out of the question.
Therefore, the only possible way I can see, is if you have an enumerable data source. IE. you know all the values prior to calculating the value. In that case you can simply assign them a sequencial id.

Categories

Resources