Text Hashing trick produces different results in Python and C#

Text Hashing trick produces different results in Python and C# - c#

I am trying to move a trained model into a production environment and have encountered an issue trying to replicate the behavior of the Keras hashing_trick() function in C#. When I go to encode the sentence my output is different in C# than it is in python:
Text: "Information - The configuration processing is completed."
Python: [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 217 142 262 113 319 413]
C#: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 433, 426, 425, 461, 336, 146, 52]
(copied from debugger, both sequences have length 30)
What I've tried:
changing the encoding of the text bytes in C# to match the python string.encode() function default (UTF8)
Changing capitalization of letters to lowercase and upper case
Tried using Convert.ToUInt32 instead of BitConverter (resulted in overflow error)
My code (below) is my implementation of the Keras hashing_trick function. A single input sentence is given and then the function will return the corresponding encoded sequence.
public uint[] HashingTrick(string data)
{
const int VOCAB_SIZE = 534; //Determined through python debugging of model
var filters = "!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n".ToCharArray().ToList();
filters.ForEach(x =>
{
data = data.Replace(x, '\0');
});
string[] parts = data.Split(' ');
var encoded = new List<uint>();
parts.ToList().ForEach(x =>
{
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
{
byte[] inputBytes = System.Text.Encoding.UTF8.GetBytes(x);
byte[] hashBytes = md5.ComputeHash(inputBytes);
uint val = BitConverter.ToUInt32(hashBytes, 0);
encoded.Add(val % (VOCAB_SIZE - 1) + 1);
}
});
return PadSequence(encoded, 30);
}
private uint[] PadSequence(List<uint> seq, int maxLen)
{
if (seq.Count < maxLen)
{
while (seq.Count < maxLen)
{
seq.Insert(0, 0);
}
return seq.ToArray();
}
else if (seq.Count > maxLen)
{
return seq.GetRange(seq.Count - maxLen - 1, maxLen).ToArray();
}
else
{
return seq.ToArray();
}
}
The keras implementation of the hashing trick can be found here
If it helps, I am using an ASP.NET Web API as my solution type.

The biggest problem with your code is that it fails to account for the fact that Python's int is an arbitrary precision integer, while C#'s uint has only 32 bits. This means that Python is calculating the modulo over all 128 bits of the hash, while C# is not (and BitConverter.ToUInt32 is the wrong thing to do in any case, as the endianness is wrong). The other problem that trips you up is that \0 does not terminate strings in C#, and \0 can't just be added to an MD5 hash without changing the outcome.
Translated in as straightforward a manner as possible:
int[] hashingTrick(string text, int n, string filters, bool lower, string split) {
var splitWords = String.Join("", text.Where(c => !filters.Contains(c)))
.Split(new[] { split }, StringSplitOptions.RemoveEmptyEntries);
return (
from word in splitWords
let bytes = Encoding.UTF8.GetBytes(lower ? word.ToLower() : word)
let hash = MD5.Create().ComputeHash(bytes)
// add a 0 byte to force a non-negative result, per the BigInteger docs
let w = new BigInteger(hash.Reverse().Concat(new byte[] { 0 }).ToArray())
select (int) (w % (n - 1) + 1)
).ToArray();
}
Sample use:
const int vocabSize = 534;
Console.WriteLine(String.Join(" ",
hashingTrick(
text: "Information - The configuration processing is completed.",
n: vocabSize,
filters: "!#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n",
lower: true,
split: " "
).Select(i => i.ToString())
));
217 142 262 113 319 413
This code has various inefficiencies: filtering characters with LINQ is very inefficient compared to using a StringBuilder and we don't really need BigInteger here since MD5 is always exactly 128 bits, but optimizing (if necessary) is left as an exercise to the reader, as is padding the outcome (which you already have a function for).

Instead of solving the issue of trying to fight with C# to get the hashing right, I took a different approach to the problem. When making my data set to train the model (this is a machine learning project after all) I decided to use #Jeron Mostert's implementation of the hashing function to pre-hash the data set before feeding it into the model.
This solution was much easier to implement and ended up working just as well as the original text hashing. Word of advice for those attempting to do cross language hashing like me: Don't do it, it's a lot of headache! Use one language for hashing your text data and find a way to create a valid data set with all of the information required.

Related

Speeding up lookup of pattern within byte array

I have a large binary file that is around 70MB in size. In my program, I have a method that looks up byte[] array patterns against the file, to see if they exist within the file or not. I have around 1-10 millions patterns to run against the file. The options I see are the following:
Read the file into memory by doing byte[] file = File.ReadAllBytes(path) then perform byte[] lookup of byte[] pattern(s) against the file bytes. I have used multiple methods for doing that from different topics on SO such as:
byte[] array pattern search
Find an array (byte[]) inside another array?
Best way to check if a byte[] is contained in another byte[] Though, byte[] versus byte[] lookups are extremely slow when the source is large in size. It would take take weeks to run 1 million patterns on normal computers.
Convert both the file and patterns into hex strings then do the comparisons using contains() method to perform the lookup. This one is faster than byte[] lookups but converting bytes to hex would result in the file being larger in memory which results in more processing time.
Convert both the file and pattern into strings using Encoding.GetEncoding(1252).GetBytes() and perform the lookups. Then, compensate for the limitation of binary to string conversion (I know they incompatible) by running the matches of contains() against another method which performs byte[] lookups (suggested first option). This one is the fastest option for me.
Using the third approach, which is the fastest, 1 million patterns would take 2/3 of a day to a day depending on CPU. I need information on how to speed up the lookups.
Thank you.
Edit: Thanks to #MySkullCaveIsADarkPlace I now have a fourth approach which is faster than the three approaches above. I was using limited byte[] lookup algorithms and now I am using MemoryExtensions.IndexOf() byte[] lookup method which is slightly faster than the three approaches above. Though, even though this method is faster, the lookups are still slow. It takes 1 minute for 1000 pattern lookups.
The patterns are 12-20 bytes each.

I assume that you are looking up one pattern after the other. I.e., you are doing 1 to 10 million pattern searches at every position in the file!
Consider doing it the other way round. Loop once through your file bytes and determine if the current position is the start of a pattern.
To do this efficiently, I suggest organizing the patterns in an array of list of patterns. Each pattern is stored in a list at array index 256 * byte[0] + byte[1].
With 10 million patterns you will have an average of 152 patterns in the lists at each array position. This allows a fast lookup.
You could also use the 3 first bytes (256 * (256 * byte[0] + byte[1]) + byte[2]) resulting in an array of length 256^3 ~ 16 millions (I worked with longer arrays; no problem for C#). Then you would have less than one pattern per array position in average. This would result in a nearly linear search time O(n) with respect to the file length. A huge improvement compared to the quadratic O(num_of_patterns * file_length) for a straight forward algorithm.
We can use a simple byte by byte comparison to compare the patterns, since we can compare starting at a known position. (Boyer Moore is of no use here.)
2 bytes index (patterns must be at least 2 bytes long)
byte[] file = { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
byte[][] patterns = {
new byte[]{ 12, 3, 5, 76, 8, 0, 6, 125, 11 },
new byte[]{ 211, 122, 22, 4 },
new byte[]{ 17, 211, 5, 8 },
new byte[]{ 22, 4, 7, 89, 76, 64 },
};
var patternMatrix = new List<byte[]>[256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 2.
foreach (byte[] pattern in patterns) {
int index = 256 * pattern[0] + pattern[1];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 1; fileIndex++) { // Length - 1 because we need 2 bytes.
int patternIndex = 256 * file[fileIndex] + file[fileIndex + 1];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 2 first bytes are matching,
// so let's start at the 3rd
for (int i = 2; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Same algorithm with 3 bytes (even faster!)
3 bytes index (patterns must be at least 3 bytes long)
var patternMatrix = new List<byte[]>[256 * 256 * 256];
// Add patterns to matrix.
// We assume pattern.Length >= 3.
foreach (byte[] pattern in patterns) {
int index = 256 * 256 * pattern[0] + 256 * pattern[1] + pattern[2];
patternMatrix[index] ??= new List<byte[]>(); // Ensure we have a list.
patternMatrix[index].Add(pattern);
}
// The search. Loop through the file
for (int fileIndex = 0; fileIndex < file.Length - 2; fileIndex++) { // Length - 2 because we need 3 bytes.
int patternIndex = 256 * 256 * file[fileIndex] + 256 * file[fileIndex + 1] + file[fileIndex + 2];
List<byte[]> candiatePatterns = patternMatrix[patternIndex];
if (candiatePatterns != null) {
foreach (byte[] candidate in candiatePatterns) {
if (fileIndex + candidate.Length <= file.Length) {
bool found = true;
// We know that the 3 first bytes are matching,
// so let's start at the 4th
for (int i = 3; i < candidate.Length; i++) {
if (candidate[i] != file[fileIndex + i]) {
found = false;
break;
}
}
if (found) {
Console.WriteLine($"pattern {{{candidate[0]}, {candidate[1]} ..}} found at file index {fileIndex}");
}
}
}
}
}
Why is it faster?
A simple nested loops algorithm compares up to ~ 706 * 106 = 7 * 1014 (700 trillion) patterns! 706 is the length of the file. 106 is the number of patterns.
My algorithm with a 2 bytes index makes ~ 706 * 152 = 1010 pattern comparisons. The number 152 comes from the fact that there are in average 152 patterns for a given 2 bytes index ~ 106/(256 * 256). This is 65,536 times faster.
With 3 bytes you get less than about 706 pattern comparisons. This is more than 10 million times faster. This is the case because we store all the patterns in an array whose length is greater (16 millions) than the number of patterns (10 millions or less). Therefore, at any byte position plus 2 following positions within the file, we can pick up only the patterns starting with the same 3 bytes. And this is in average less than one pattern. Sometimes there may be 0 or 1, sometimes 2 or 3, but rarely more patterns at any array position.
Try it. The shift is from O(n2) to near O(n). The initialization time is O(n). The assumption is that the 2 or 3 first bytes of the patterns are more or less randomly distributed. If this was not the case, my algorithm would degrade to O(n2) in the worst case.
Okay, that's the theory. Since the 3 bytes index version is slower at initialization it may have only an advantage with huge data sets. Other improvements could be made by using Span<byte>.
See: Big O notation - Wikipedia.

One idea is to group the patterns by their length, put each group in a HashSet<byte[]> for searching with O(1) complexity, and then scan the source byte[] index by index for all groups. Since the number of groups in your case is small (only 9 groups), this optimization should yield significant performance improvements. Here is an implementation:
IEnumerable<byte[]> FindMatches(byte[] source, byte[][] patterns)
{
Dictionary<int, HashSet<ArraySegment<byte>>> buckets = new();
ArraySegmentComparer comparer = new();
foreach (byte[] pattern in patterns)
{
HashSet<ArraySegment<byte>> bucket;
if (!buckets.TryGetValue(pattern.Length, out bucket))
{
bucket = new(comparer);
buckets.Add(pattern.Length, bucket);
}
bucket.Add(pattern); // Implicit cast byte[] => ArraySegment<byte>
}
for (int i = 0; i < source.Length; i++)
{
foreach (var (length, bucket) in buckets)
{
if (i + length > source.Length) continue;
ArraySegment<byte> slice = new(source, i, length);
if (bucket.TryGetValue(slice, out var pattern))
{
yield return pattern.Array;
bucket.Remove(slice);
}
}
}
}
Currently (.NET 6) there is no equality comparer for sequences available in the standard libraries, so you'll have to provide a custom one:
class ArraySegmentComparer : IEqualityComparer<ArraySegment<byte>>
{
public bool Equals(ArraySegment<byte> x, ArraySegment<byte> y)
{
return x.AsSpan().SequenceEqual(y);
}
public int GetHashCode(ArraySegment<byte> obj)
{
HashCode hashcode = new();
hashcode.AddBytes(obj);
return hashcode.ToHashCode();
}
}
This algorithm assumes that there are no duplicates in the patterns. In case that's not the case, only one of the duplicates will be emitted.
In my (not very speedy) PC this algorithm takes around 10 seconds to create the buckets dictionary (for 10,000,000 patterns with size 12-20), and then additional 5-6 minutes to scan a source byte[] of size 70,000,000 (scans around 200,000 bytes per second). The number of the patterns does not affect the scanning phase (as long as the number of the groups is not increased).
Parallelizing this algorithm is not trivial, because the buckets are mutated during the scan.

Converting C# BitConverter.GetBytes() to PHP

I'm trying to port this C# code to PHP:
var headerList = new List<byte>();
headerList.AddRange(Encoding.ASCII.GetBytes("Hello\n"));
headerList.AddRange(BitConverter.GetBytes(1));
byte[] header = headerList.ToArray();
If I output header, what does it looks like?
My progress so far:
$in_raw = "Hello\n";
for($i = 0; $i < mb_strlen($in_raw, 'ASCII'); $i++){
$in.= ord($in_raw[$i]);
}
$k=1;
$byteK=array(8); // should be 16? 32?...
for ($i = 0; $i < 8; $i++){
$byteK[$i] = (( $k >> (8 * $i)) & 0xFF); // Don't known if it is a valid PHP bitwise op
}
$in.=implode($byteK);
print_r($in);
Which gives me this output: 721011081081111010000000
I'm pretty confident that the first part of converting the string to ASCII bytes is correct, but these BitConverter... I don't know what to expect as output...
This string (or byte array) is used as an handshake for an socket connection. I know that the C# version does work, but my refurnished code doesn't.

If you don't have access to a machine/tool that can run C#, there are a couple of REPL websites that you can use. I've taken your code, qualified a couple of the namespaces (just for convenience), wrapped it in a main() method to just run once as a CLI and put it here. It also includes a for loop that writes the contents of the array out so that you can see what is at each index.
Here's the same code for reference:
using System;
class MainClass {
public static void Main (string[] args) {
var headerList = new System.Collections.Generic.List<byte>();
headerList.AddRange(System.Text.Encoding.ASCII.GetBytes("Hello\n"));
headerList.AddRange(System.BitConverter.GetBytes(1));
byte[] header = headerList.ToArray();
foreach(byte b in header){
Console.WriteLine(b);
}
}
}
When you run this code, the following output is generated:
72
101
108
108
111
10
1
0
0
0

Encoding.ASCII.GetBytes("Hello\n").ToArray()
gives byte[6] { 72, 101, 108, 108, 111, 10 }
BitConverter.GetBytes((Int64)1).ToArray()
gives byte[8] { 1, 0, 0, 0, 0, 0, 0, 0 }
BitConverter.GetBytes((Int32)1).ToArray()
byte[4] { 1, 0, 0, 0 }
the last one is default compiler conversion of 1.
if PHP code please try $byteK=array(4); and $i < 4

The string "Hello\n" is already encoded in ASCII so you have nothing to do.
BitConverter.GetBytes() gives the binary representation of a 32-bit integer in machine byte order, which can be done in PHP with the pack() function and the l format.
So the PHP code is simply:
$in = "Hello\n";
$in .= pack('l', 1);

Convert large number to two bytes in C#

I'm trying to convert a number from a textbox into 2 bytes which can then be sent over serial. The numbers range from 500 to -500. I already have a setup so I can simply send a string which is then converted to a byte. Here's a example:
send_serial("137", "1", "244", "128", "0")
The textbox number will go in the 2nd and 3rd bytes
This will make my Roomba (The robot that all this code is for) drive forward at a velocity of 500 mm/s. The 1st number sent tells the roomba to drive, 2nd and 3rd numbers are the velocity and the 4th and 5th numbers are the radius of the turn (between 2000 and -2000, also has a special case where 32768 is straight).

var value = "321";
var shortNumber = Convert.ToInt16(value);
var bytes = BitConverter.GetBytes(shortNumber);
Alternatively, if you require Big-Endian ordering:
var bigEndianBytes = new[]
{
(byte) (shortNumber >> 8),
(byte) (shortNumber & byte.MaxValue)
};

Assume you are using System.IO.Ports.SerialPort, you will write using SerialPort.Write(byte[], int, int) to send the data.
In case if your input is like this: 99,255, you will do this to extract two bytes:
// Split the string into two parts
string[] strings = textBox1.text.Split(',');
byte byte1, byte2;
// Make sure it has only two parts,
// and parse the string into a byte, safely
if (strings.Length == 2
&& byte.TryParse(strings[0], System.Globalization.NumberStyles.Integer, System.Globalization.CultureInfo.InvariantCulture, out byte1)
&& byte.TryParse(strings[1], System.Globalization.NumberStyles.Integer, System.Globalization.CultureInfo.InvariantCulture, out byte2))
{
// Form the bytes to send
byte[] bytes_to_send = new byte[] { 137, byte1, byte2, 128, 0 };
// Writes the data to the serial port.
serialPort1.Write(bytes_to_send, 0, bytes_to_send.Length);
}
else
{
// Show some kind of error message?
}
Here I assume your "byte" is from 0 to 255, which is the same as C#'s byte type. I used byte.TryParse to parse the string into a byte.

Calculating the number of bits in a Subnet Mask in C#

I have a task to complete in C#. I have a Subnet Mask: 255.255.128.0.
I need to find the number of bits in the Subnet Mask, which would be, in this case, 17.
However, I need to be able to do this in C# WITHOUT the use of the System.Net library (the system I am programming in does not have access to this library).
It seems like the process should be something like:
1) Split the Subnet Mask into Octets.
2) Convert the Octets to be binary.
3) Count the number of Ones in each Octet.
4) Output the total number of found Ones.
However, my C# is pretty poor. Does anyone have the C# knowledge to help?

Bit counting algorithm taken from:
http://www.necessaryandsufficient.net/2009/04/optimising-bit-counting-using-iterative-data-driven-development/
string mask = "255.255.128.0";
int totalBits = 0;
foreach (string octet in mask.Split('.'))
{
byte octetByte = byte.Parse(octet);
while (octetByte != 0)
{
totalBits += octetByte & 1; // logical AND on the LSB
octetByte >>= 1; // do a bitwise shift to the right to create a new LSB
}
}
Console.WriteLine(totalBits);
The most simple algorithm from the article was used. If performance is critical, you might want to read the article and use a more optimized solution from it.

string ip = "255.255.128.0";
string a = "";
ip.Split('.').ToList().ForEach(x => a += Convert.ToInt32(x, 2).ToString());
int ones_found = a.Replace("0", "").Length;

A complete sample:
public int CountBit(string mask)
{
int ones=0;
Array.ForEach(mask.Split('.'),(s)=>Array.ForEach(Convert.ToString(int.Parse(s),2).Where(c=>c=='1').ToArray(),(k)=>ones++));
return ones
}

You can convert a number to binary like this:
string ip = "255.255.128.0";
string[] tokens = ip.Split('.');
string result = "";
foreach (string token in tokens)
{
int tokenNum = int.Parse(token);
string octet = Convert.ToString(tokenNum, 2);
while (octet.Length < 8)
octet = octet + '0';
result += octet;
}
int mask = result.LastIndexOf('1') + 1;

The solution is to use a binary operation like
foreach(string octet in ipAddress.Split('.'))
{
int oct = int.Parse(octet);
while(oct !=0)
{
total += oct & 1; // {1}
oct >>=1; //{2}
}
}
The trick is that on line {1} the binary AND is in sence a multiplication so multiplicating 1x0=0, 1x1=1. So if we have some hypothetic number
0000101001 and multiply it by 1 (so in binary world we execute &), which is nothig else then 0000000001, we get
0000101001
0000000001
Most right digit is 1 in both numbers so making binary AND return 1, otherwise if ANY of the numbers minor digit will be 0, the result will be 0.
So here, on line total += oct & 1 we add to tolal either 1 or 0, based on that digi number.
On line {2}, instead we just shift the minor bit to right by, actually, deviding the number by 2, untill it becomes 0.
Easy.
EDIT
This is valid for intgere and for byte types, but do not use this technique on floating point numbers. By the way, it's pretty valuable solution for this question.

Reversing a hash function

I have the following hash function, and I'm trying to get my way to reverse it, so that I can find the key from a hashed value.
uint Hash(string s)
{
uint result = 0;
for (int i = 0; i < s.Length; i++)
{
result = ((result << 5) + result) + s[i];
}
return result;
}
The code is in C# but I assume it is clear.
I am aware that for one hashed value, there can be more than one key, but my intent is not to find them all, just one that satisfies the hash function suffices.
EDIT :
The string that the function accepts is formed only from digits 0 to 9 and the chars '*' and '#' hence the Unhash function must respect this criteria too.
Any ideas? Thank you.

This should reverse the operations:
string Unhash(uint hash)
{
List<char> s = new List<char>();
while (hash != 0)
{
s.Add((char)(hash % 33));
hash /= 33;
}
s.Reverse();
return new string(s.ToArray());
}
This should return a string that gives the same hash as the original string, but it is very unlikely to be the exact same string.

Characters 0-9,*,# have ASCII values 48-57,42,35, or binary: 00110000 ... 00111001, 00101010, 00100011
First 5 bits of those values are different, and 6th bit is always 1. This means that you can deduce your last character in a loop by taking current hash:
uint lastChar = hash & 0x1F - ((hash >> 5) - 1) & 0x1F + 0x20;
(if this doesn't work, I don't know who wrote it)
Now roll back hash,
hash = (hash - lastChar) / 33;
and repeat the loop until hash becomes zero. I don't have C# on me, but I'm 70% confident that this should work with only minor changes.

Brute force should work if uint is 32 bits. Try at least 2^32 strings and one of them is likely to hash to the same value. Should only take a few minutes on a modern pc.
You have 12 possible characters, and 12^9 is about 2^32, so if you try 9 character strings you're likely to find your target hash. I'll do 10 character strings just to be safe.
(simple recursive implementation in C++, don't know C# that well)
#define NUM_VALID_CHARS 12
#define STRING_LENGTH 10
const char valid_chars[NUM_VALID_CHARS] = {'0', ..., '#' ,'*'};
void unhash(uint hash_value, char *string, int nchars) {
if (nchars == STRING_LENGTH) {
string[STRING_LENGTH] = 0;
if (Hash(string) == hash_value) { printf("%s\n", string); }
} else {
for (int i = 0; i < NUM_VALID_CHARS; i++) {
string[nchars] = valid_chars[i];
unhash(hash_value, string, nchars + 1);
}
}
}
Then call it with:
char string[STRING_LENGTH + 1];
unhash(hash_value, string, 0);

Hash functions are designed to be difficult or impossible to reverse, hence the name (visualize meat + potatoes being ground up)

I would start out by writing each step that result = ((result << 5) + result) + s[i]; does on a separate line. This will make solving a lot easier. Then all you have to do is the opposite of each line (in the opposite order too).

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Text Hashing trick produces different results in Python and C# - c#

Related

Speeding up lookup of pattern within byte array

Converting C# BitConverter.GetBytes() to PHP

Convert large number to two bytes in C#

Calculating the number of bits in a Subnet Mask in C#

Reversing a hash function

Categories

Resources