How to calculate hashbyte SHA1 using C#? - c#

In a table I have a column URL which I am using to save urls. I am calculating the hash in another column by using formula (CONVERT([varbinary](20),hashbytes('SHA1',[URL]))). It's working fine.
Now I need to get similar function in C# to get hash so that I can compare and check that similar row doesnt exist before I insert a new row. I tried few links but no luck.
Here are the links:
http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx
How do I calculate the equivalent to SQL Server (hashbytes('SHA1',[ColumnName])) in C#?
** I found this link working. All I need to do is change formula in the db. but is it possible to make it in one line
**
http://forums.asp.net/t/1782626.aspx
DECLARE #HashThis nvarchar(4000);
DECLARE #BinHash varbinary(4000);
SELECT #HashThis = CONVERT(nvarchar(4000),'Password#Test');
SELECT #BinHash = HashBytes('SHA1', #HashThis);
SELECT cast(N'' as xml).value('xs:base64Binary(xs:hexBinary(sql:variable("#BinHash")))', 'nvarchar(4000)');
in c#
string pwd = "Password#Test";
var sha1Provider = HashAlgorithm.Create("SHA1");
var binHash = sha1Provider.ComputeHash(Encoding.Unicode.GetBytes(pwd));
Console.WriteLine(Convert.ToBase64String(binHash));
I am using sql server 2012. collation for the database is SQL_Latin1_General_CP1_CI_AS
Thanks
Paraminder

It's an encoding issue:
C#/.Net/CLR strings are, internally, UTF-16 encoded strings. That means each character is at least two bytes.
Sql Server is different:
char and varchar represent each character as a single byte using the code page tied to the collation used by that column
nchar and nvarchar represent each character as 2 bytes using the [old and obsolete] UCS-2 encoding for Unicode — something which was deprecated in 1996 with the release of Unicode 2.0 and UTF-16.
The big difference between UTF-16 and UCS-2 is that UCS-2 can only represent characters within the Unicode BMP (Basic Multilingual Plane); UTF-16 can represent any Unicode character. Within the BMP, as I understand it, UCS-2 and UTF-16 representations are identical.
That means that to compute a hash that is identical to the one that SQL Server computes, you're going to have to get a byte representation that is identical to the one that SQL Server has. Since it sounds like you're using char or varchar with the collation SQL_Latin1_General_CP1_CI_AS, per the documentation, the CP1 part means code page 1252 and the rest means case-insensitive, accent-sensitive. So...
You can get the encoding for code page 1252 by:
Encoding enc = Encoding.GetEncoding(1252);
Using that information, and given this table:
create table dbo.hash_test
(
id int not null identity(1,1) primary key clustered ,
source_text varchar(2000) collate SQL_Latin1_General_CP1_CI_AS not null ,
hash as
( hashbytes( 'SHA1' , source_text ) ) ,
)
go
insert dbo.hash_test ( source_text ) values ( 'the quick brown fox jumped over the lazy dog.' )
insert dbo.hash_test ( source_text ) values ( 'She looked like something that might have occured to Ibsen in one of his less frivolous moments.' )
go
You'll get this output
1: the quick brown fox jumped over the lazy dog.
sql: 6039D100 3323D483 47DDFDB5 CE2842DF 758FAB5F
c#: 6039D100 3323D483 47DDFDB5 CE2842DF 758FAB5F
2: She looked like something that might have occured to Ibsen in one of his less frivolous moments.
sql: D92501ED C462E331 B0E129BF 5B4A854E 8DBC490C
c#: D92501ED C462E331 B0E129BF 5B4A854E 8DBC490C
from this program
class Program
{
static byte[] Sha1Hash( string s )
{
SHA1 sha1 = SHA1.Create() ;
Encoding windows1252 = Encoding.GetEncoding(1252) ;
byte[] octets = windows1252.GetBytes(s) ;
byte[] hash = sha1.ComputeHash( octets ) ;
return hash ;
}
static string HashToString( byte[] bytes )
{
StringBuilder sb = new StringBuilder() ;
for ( int i = 0 ; i < bytes.Length ; ++i )
{
byte b = bytes[i] ;
if ( i > 0 && 0 == i % 4 ) sb.Append( ' ' ) ;
sb.AppendFormat( b.ToString("X2") ) ;
}
string s = sb.ToString() ;
return s ;
}
private static DataTable ReadDataFromSqlServer()
{
DataTable dt = new DataTable();
using ( SqlConnection conn = new SqlConnection( "Server=localhost;Database=sandbox;Trusted_Connection=True;"))
using ( SqlCommand cmd = conn.CreateCommand() )
using ( SqlDataAdapter sda = new SqlDataAdapter(cmd) )
{
cmd.CommandText = "select * from dbo.hash_test" ;
cmd.CommandType = CommandType.Text;
conn.Open();
sda.Fill( dt ) ;
conn.Close() ;
}
return dt ;
}
static void Main()
{
DataTable dt = ReadDataFromSqlServer() ;
foreach ( DataRow row in dt.Rows )
{
int id = (int) row[ "id" ] ;
string sourceText = (string) row[ "source_text" ] ;
byte[] sqlServerHash = (byte[]) row[ "hash" ] ;
byte[] myHash = Sha1Hash( sourceText ) ;
Console.WriteLine();
Console.WriteLine( "{0:##0}: {1}" , id , sourceText ) ;
Console.WriteLine( " sql: {0}" , HashToString( sqlServerHash ) ) ;
Console.WriteLine( " c#: {0}" , HashToString( myHash ) ) ;
Debug.Assert( sqlServerHash.SequenceEqual(myHash) ) ;
}
return ;
}
}
Easy!

I would suggest that that anytime a hash is created that it be done in a single place. Either in code or on the database. It will make your life easier in the long run. That would mean either changing you C# code to create the hash before inserting the record or doing the duplication check within a stored procedure instead.
Regardless though, the duplication check and insert should be synchronized such that no other inserts could occur between the time you check for any duplicates and when the record is actually inserted. Easiest way to do that would be to perform them both within the same transaction.
If you insist on leaving the logic as it stands I would then suggest that you create the hash in the database but expose it via a stored procedure or user defined function that could be called from your C# code.

Related

How do you add a line break after the first `=` in PowerBI measures when using Tabular Editor c# scripts?

The FormatDax function in Tabular Editor doesn't put a newline after defining the Measure name. How do you update all the Measure formulas to include a newline after using the FormatDax function?
For example. Turning this...
Count = DIVIDE (
COUNT ( Tests[Lot] ),
DISTINCTCOUNT ( Tests[Part Number] )
)
Into this...
Count =
DIVIDE (
COUNT ( Tests[Lot] ),
DISTINCTCOUNT ( Tests[Part Number] )
)
This took some time to figure out. Leaving a Q&A for easier finding next time.
// C# Script
FormatDax(Model.AllMeasures);
foreach(var measure in Model.AllMeasures)
{
measure.Expression = Environment.NewLine + measure.Expression;
}

TSQL MD5 generation with UTF8

I have a .NET function MD5 that when run on "146.185.59.178acu-cell.com" it returns f36674ed3dbcb151e1c0dfe4acdbb9f5
public static String MD5(String s)
{
using (var provider = System.Security.Cryptography.MD5.Create())
{
StringBuilder builder = new StringBuilder();
foreach (Byte b in provider.ComputeHash(Encoding.UTF8.GetBytes(s)))
builder.Append(b.ToString("x2").ToLower());
return builder.ToString();
}
}
I wrote the same code in TSQL, but for some reason only the varchar returns the expected result. The nvarchar returns a different md5 : f04b83328560f1bd1c08104b83bc30ea
declare #v varchar(150) = '146.185.59.178acu-cell.com'
declare #nv nvarchar(150) = '146.185.59.178acu-cell.com'
select LOWER(CONVERT(VARCHAR(32), HashBytes('MD5', #v), 2))
--f36674ed3dbcb151e1c0dfe4acdbb9f5
select LOWER(CONVERT(VARCHAR(32), HashBytes('MD5',#nv), 2))
--f04b83328560f1bd1c08104b83bc30ea
Not sure what is going on here because I do expect for the nvarchar to return f36674ed3dbcb151e1c0dfe4acdbb9f5 as it does in .NET
You get different hashes because the binary representation of the text is different. The following query demonstrates this:
declare #v varchar(150) = '146.185.59.178acu-cell.com'
declare #nv nvarchar(150) = '146.185.59.178acu-cell.com'
select convert(varbinary(max), #v) -- 0x3134362E3138352E35392E3137386163752D63656C6C2E636F6D
select convert(varbinary(max), #nv) -- 0x3100340036002E003100380035002E00350039002E003100370038006100630075002D00630065006C006C002E0063006F006D00
The extra 0 bytes for the nvarchar are due to the fact that it's a 2-byte Unicode datatype. Refer to MSDN for more information on Unicode in SQL Server.
Turns out I need to explicitly convert NVarChar to UTF8
Found this code on the net:
CREATE FUNCTION [dbo].[fnUTF8] (
#String NVarChar(max)
) RETURNS VarChar(max) AS BEGIN
DECLARE #Result VarChar(max)
,#Counter Int
,#Len Int
SELECT #Result = ''
,#Counter = 1
,#Len = Len(#String)
WHILE (##RowCount > 0)
SELECT #Result = #Result
+ CASE WHEN Code < 128 THEN ''
WHEN Code < 2048 THEN Char(192 + Code / 64)
ELSE Char(224 + Code / 4096)
END
+ CASE WHEN Code < 128 THEN Char(Code)
WHEN Code < 2048 THEN Char(128 + Code % 64)
ELSE Char(128 + Code / 64 % 64)
END
,#Counter = #Counter + 1
FROM (SELECT UniCode(SubString(#String,#Counter,1)) AS Code) C
WHERE #Counter <= #Len
RETURN #Result
END
GO
And now I use it like this:
select LOWER(CONVERT(VARCHAR(32), HashBytes('MD5', [dbo].[fnUTF8](#nv)), 2))

Getting binary column from SQL Server database

I need to import a column from a SQL Server database using C#.
When I use SQL Server Enterprise, it is shown as <binary> and when I run query on the SQL Server, it returns the right binary values.
However, when I try coding with C# like so:
SqlConnection conn = new SqlConnection("Server=portable;Database=data;Integrated Security=true;");
conn.Open();
SqlCommand cmd = new SqlCommand("SELECT bnry FROM RawData", conn);
SqlDataReader reader = cmd.ExecuteReader();
while(reader.Read())
{
Console.WriteLine(reader. //I do not know what to put here
}
reader.Close();
conn.Close();
When I put reader.GetSqlBinary(0));, I only get many SqlBinary<4096>s as output.
When I look at SQL Server Query Analyzer, when I try the same command, it gets me 0x0000.. type of code.
What should I put after reader. or is there another method of getting this data from the database?
You're getting back an array of bytes, so use :
byte[] bytes = (byte[])reader[0];
What you do with them from there depends on what the bytes represent.
The better way to do this is with a stream so that you properly dispose of the stream (probably coming via a socket connection) when finished. So I would use GetStream()
using(Stream stream = reader[0].GetStream())
{
//do your work on the stream here.
}
Given this table
create table dbo.bin_test
(
c1 binary(8) not null ,
c2 binary(8) null ,
c3 varbinary(8) not null ,
c4 varbinary(8) null ,
)
insert dbo.bin_test values ( 0x1234 , null , 0x1234 , null )
insert dbo.bin_test values ( 0x012345678 , 0x12345678 , 0x12345678 , 0x12345678 )
This code (IF you're going to use SQLBinary)
string connectString = "Server=localhost;Database=sandbox;Trusted_Connection=True;" ;
using ( SqlConnection connection = new SqlConnection(connectString) )
using ( SqlCommand cmd = connection.CreateCommand() )
{
cmd.CommandText = "select * from dbo.bin_test" ;
cmd.CommandType = CommandType.Text ;
connection.Open() ;
using ( SqlDataReader reader = cmd.ExecuteReader() )
{
int row = 0 ;
while ( reader.Read() )
{
for ( int col = 0 ; col < reader.FieldCount ; ++col )
{
Console.Write( "row{0,2}, col{1,2}: " , row , col ) ;
SqlBinary octets = reader.GetSqlBinary(col) ;
if ( octets.IsNull )
{
Console.WriteLine( "{null}");
}
else
{
Console.WriteLine( "length={0:##0}, {{ {1} }}" , octets.Length , string.Join( " , " , octets.Value.Select(x => string.Format("0x{0:X2}",x)))) ;
}
}
Console.WriteLine() ;
++row ;
}
}
connection.Close() ;
}
should produce:
row 0, col 0: length=8, { 0x12 , 0x34 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 , 0x00 }
row 0, col 1: {null}
row 0, col 2: length=2, { 0x12 , 0x34 }
row 0, col 3: {null}
row 1, col 0: length=8, { 0x00 , 0x12 , 0x34 , 0x56 , 0x78 , 0x00 , 0x00 , 0x00 }
row 1, col 1: length=8, { 0x12 , 0x34 , 0x56 , 0x78 , 0x00 , 0x00 , 0x00 , 0x00 }
row 1, col 2: length=4, { 0x12 , 0x34 , 0x56 , 0x78 }
row 1, col 3: length=4, { 0x12 , 0x34 , 0x56 , 0x78 }
But as noted, it's probably cleaner to simply do this:
byte[] octets = reader[0] as byte[] ;
if ( octets == null )
{
Console.WriteLine( "{null}");
}
else
{
Console.WriteLine( "length={0:##0}, {{ {1} }}" , octets.Length , string.Join( " , " , octets.Select(x => string.Format("0x{0:X2}",x)))) ;
}
And get the same result.
You should access data from SqlDataReader indexer (MSDN).
So it would look like this:
//..
Console.WriteLine((byte[])reader["bnry"]);
//..
UPDATE:
I guess I finally got where is your problem. We are not on the same page here. I will try to be as simple as possible.
For start you need to understand that all the information in computers are stored in memory as a bunch of bytes. It is quite cumbersome to work directly with bytes in memory, so different data types were introduced (int, string, Image, etc) to ease programmers life. Most objects in .NET still can be converted to their internal representation as a byte array in one or other way. During this conversion you loose the information about what byte array contains - it can easily be Image, string or even int array. To get back from binary representation, you need to know what byte array contains.
In your example you are trying to write out byte array directly. As output always needs to be text, byte array somehow needs to be converted to string. This is done by calling .ToString() function on byte array. Unfortunately for you, default implementation of .ToString() for complex objects just returns type name. That's where all those System.Byte[] and SqlBinary<4096> lines are coming from.
To get over this issue, before displaying result you need to convert byte array to necessary type. It seems that in your case byte array contains some textual information, so I guess you need to convert byte array to string. To do that, you need to know encoding (way how string is stored in memory) of the text.
Basically, your code should look like this:
SqlConnection conn = new SqlConnection("Server=portable;Database=data;Integrated Security=true;");
conn.Open();
SqlCommand cmd = new SqlCommand("SELECT bnry FROM RawData", conn);
SqlDataReader reader = cmd.ExecuteReader();
while(reader.Read())
{
var valueAsArray = (byte[])reader["bnry"];
//as there are different encodings possible, you need to find encoding what works for you
var valueAsStringDefault = System.Text.Encoding.Default.GetString(valueAsArray);
Console.WriteLine(valueAsStringDefault);
//...or...
var valueAsStringUTF8 = System.Text.Encoding.UTF8.GetString(valueAsArray);
Console.WriteLine(valueAsStringUTF8);
//...or...
var valueAsStringUTF7 = System.Text.Encoding.UTF7.GetString(valueAsArray);
Console.WriteLine(valueAsStringUTF7);
//...or any other encoding. Most of them you can find in System.Text.Encoding namespace...
}
reader.Close();
conn.Close();

Convert a file full of "INSERT INTO xxx VALUES" in to something Bulk Insert can parse

This is a followup to my first question "Porting “SQL” export to T-SQL".
I am working with a 3rd party program that I have no control over and I can not change. This program will export it's internal database in to a set of .sql each one with a format of:
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField], [BinaryField])
VALUES
(1 , 'Some Text' , 0x123456),
(2 , 'B' , NULL),
--(SNIP, it does this for 1000 records)
(999, 'E' , null);
(1000 , 'F' , null);
INSERT INTO [ExampleDB] ( [IntField] , [VarcharField] , BinaryField)
VALUES
(1001 , 'asdg', null),
(1002 , 'asdf' , 0xdeadbeef),
(1003 , 'dfghdfhg' , null),
(1004 , 'sfdhsdhdshd' , null),
--(SNIP 1000 more lines)
This pattern continues till the .sql file has reached a file size set during the export, the export files are grouped by EXPORT_PATH\%Table_Name%\Export#.sql Where the # is a counter starting at 1.
Currently I have about 1.3GB data and I have it exporting in 1MB chunks (1407 files across 26 tables, All but 5 tables only have one file, the largest table has 207 files).
Right now I just have a simple C# program that reads each file in to ram then calls ExecuteNonQuery. The issue is I am averaging 60 sec/file which means it will take about 23 hrs for it to do the entire export.
I assume if I some how could format the files to be loaded with a BULK INSERT instead of a INSERT INTO it could go much faster. Is there any easy way to do this or do I have to write some kind of Find & Replace and keep my fingers crossed that it does not fail on some corner case and blow up my data.
Any other suggestions on how to speed up the insert into would also be appreciated.
UPDATE:
I ended up going with the parse and do a SqlBulkCopy method. It went from 1 file/min. to 1 file/sec.
Well, here is my "solution" for helping convert the data into a DataTable or otherwise (run it in LINQPad):
var i = "(null, 1 , 'Some''\n Text' , 0x123.456)";
var pat = #",?\s*(?:(?<n>null)|(?<w>[\w.]+)|'(?<s>.*)'(?!'))";
Regex.Matches(i, pat,
RegexOptions.IgnoreCase | RegexOptions.Singleline).Dump();
The match should be run once per value group (e.g. (a,b,etc)). Parsing of the results (e.g. conversion) is left to the caller and I have not tested it [much]. I would recommend creating the correctly-typed DataTable first -- although it may be possible to pass everything "as a string" to the database? -- and then use the information in the columns to help with the extraction process (possibly using type converters). For the captures: n is null, w is word (e.g. number), s is string.
Happy coding.
Apparently your data is always wrapped in parentheses and starts with a left parenthesis. You might want to use this rule to split(RemoveEmptyEntries) each of those lines and load it into a DataTable. Then you can use SqlBulkCopy to copy all at once into the database.
This approach would not necessarily be fail-safe, but it would be certainly faster.
Edit: Here's the way how you could get the schema for every table:
private static DataTable extractSchemaTable(IEnumerable<String> lines)
{
DataTable schema = null;
var insertLine = lines.SkipWhile(l => !l.StartsWith("INSERT INTO [")).Take(1).First();
var startIndex = insertLine.IndexOf("INSERT INTO [") + "INSERT INTO [".Length;
var endIndex = insertLine.IndexOf("]", startIndex);
var tableName = insertLine.Substring(startIndex, endIndex - startIndex);
using (var con = new SqlConnection("CONNECTION"))
{
using (var schemaCommand = new SqlCommand("SELECT * FROM " tableName, con))
{
con.Open();
using (var reader = schemaCommand.ExecuteReader(CommandBehavior.SchemaOnly))
{
schema = reader.GetSchemaTable();
}
}
}
return schema;
}
Then you simply need to iterate each line in the file, check if it starts with ( and split that line by Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries). Then you could add the resulting array into the created schema-table.
Something like this:
var allLines = System.IO.File.ReadAllLines(path);
DataTable result = extractSchemaTable(allLines);
for (int i = 0; i < allLines.Length; i++)
{
String line = allLines[i];
if (line.StartsWith("("))
{
String data = line.Substring(1, line.Length - (line.Length - line.LastIndexOf(")")) - 1);
var fields = data.Split(new[] { ',' }, StringSplitOptions.RemoveEmptyEntries);
// you might need to parse it to correct DataColumn.DataType
result.Rows.Add(fields);
}
}

SQL/C# - Primary Key error on UPSERT

UPDATE(simplified problem, removed C# from the issue)
How can I write an UPSERT that can recognize when two rows are the same in the following case...
See how there's a \b [backspace] encoded there (the weird little character)? SQL sees these as the same. While my UPSERT sees this as new data and attempts an INSERT where there should be an UPDATE.
//UPSERT
INSERT INTO [table]
SELECT [col1] = #col1, [col2] = #col2, [col3] = #col3, [col4] = #col4
FROM [table]
WHERE NOT EXISTS
-- race condition risk here?
( SELECT 1 FROM [table]
WHERE
[col1] = #col1
AND [col2] = #col2
AND [col3] = #col3)
UPDATE [table]
SET [col4] = #col4
WHERE
[col1] = #col1
AND [col2] = #col2
AND [col3] = #col3
You need the # sign, otherwise a C# character escape sequence is hit.
C# defines the following character escape sequences:
\' - single quote, needed for character literals
\" - double quote, needed for string literals
\\ - backslash
\0 - Unicode character 0
\a - Alert (character 7)
\b - Backspace (character 8)
\f - Form feed (character 12)
\n - New line (character 10)
\r - Carriage return (character 13)
\t - Horizontal tab (character 9)
\v - Vertical quote (character 11)
\uxxxx - Unicode escape sequence for character with hex value xxxx
\xn[n][n][n] - Unicode escape sequence for character with hex value nnnn (variable length version of \uxxxx)
\Uxxxxxxxx - Unicode escape sequence for character with hex value xxxxxxxx (for generating surrogates)
After hours of tinkering it turns out I've been on a wild goose chase. The problem is very simple. I pulled my UPSERT from a popular SO post. The code is no good. The select will sometimes return > 1 rows on INSERT. Thereby attempting to insert a row, then insert the same row again.
The fix is to remove FROM
//UPSERT
INSERT INTO [table]
SELECT [col1] = #col1, [col2] = #col2, [col3] = #col3, [col4] = #col4
--FROM [table] (Dont use FROM..not a race condition, just a bad SELECT)
WHERE NOT EXISTS
( SELECT 1 FROM [table]
WHERE
[col1] = #col1
AND [col2] = #col2
AND [col3] = #col3)
UPDATE [table]
SET [col4] = #col4
WHERE
[col1] = #col1
AND [col2] = #col2
AND [col3] = #col3
Problem is gone.
Thanks to all of you.
You are using '\u' which generates a Unicode character.
Your column is a varchar, which does not support Unicode characters. nvarchar would support the character.

Categories

Resources