fast encryption for a large unicode text file

fast encryption for a large unicode text file - c#

I have a large unicode text file (35 MB) containing words separated by punctuation marks. I need to somehow hide the content of the file (at least from the majority of people who are not specialised in cracking).
The best way until now seemed like encryption. I know almost nothing about encryption. I tried to use the solution to a similar question "Simple 2 way encryption for C#" but it takes a long time to execute the encryption.
What is the fastest way (algorithm) that works out of the box (i.e. it is contained in the .Net lib)? A short example on how to use it would be nice :)
I don't care how strong the encryption is, if you open the encrypted file with a text editor and don't see the words then it's perfect. The important part is speed.

AES is pretty fast still, here's some help implementing it : Using AES encryption in C#
Anything other than industry standard Encryption is asking for problems sooner or later.

What have you tried so far? Are standard encryptions like AES and blowfish too slow?
You can always do something simple like xor-ing the contents against some pass-code repeated to the same length as the file.

As tilleryj said
xor-ing the contents against some pass-code repeated to the same length as the file is simple and fast
but it is les safe then other encription types.
I wrote a simple class that helps you encript a string using another string as a password usig the xor method. hope some else can use it.
`using System;
using System.Text;
namespace MyEncriptionNameSpace
{
class XorStringEncripter
{
private string __passWord;
public XorStringEncripter(string password)
{
if (password.Length == 0)
{
throw new Exception("invalide password");
}
__passWord = password;
}
public string encript(string stringToEncript)
{
return __encript(stringToEncript);
}
public string decript(string encripTedString)
{
return __encript(encripTedString);
}
public string __encript(string stringToEncript)
{
var encriptedStringBuilder = new StringBuilder(stringToEncript.Length);
int positionInPassword = 0;
for (int i = 0; i < stringToEncript.Length; i++)
{
__corectPositionInPassWord(ref positionInPassword);
encriptedStringBuilder.Append((char)((int)stringToEncript[i] ^ (int)__passWord[positionInPassword]));
++positionInPassword;
}
return encriptedStringBuilder.ToString();
}
private void __corectPositionInPassWord(ref int positionInPassword)
{
if (positionInPassword == __passWord.Length)
{
positionInPassword = 0;
}
}
}
}`
actualy encript and decript do the same thing , I provided bouth to avoid confusion on using the same function for bouth encription and decrition. This is because if you have a nuber A and you xor it with B and you obtain C then if you xor C and B you get A.
A xor B = C ---> C xor B = A

Related

Removing punctuation from an extremely long string

I'm working on a book encryption program for one of my courses and I've run into a problem. Our professor gave us the example of using say Pride and Prejudice as the book used to encrypt, so I chose that one to test my program. The current function I'm using to remove the punctuation from the string is taking so long that the program is being forced into break mode. This function works for smaller strings even pages long, but when I fed it Pride and Prejudice it takes way to long.
public void removePunctuation(ref string s) {
string result = "";
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result += ' ';
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result += s[i];
}
}
s = result;
}
So I think I need a faster way to remove punctuation from this string if anyone has any suggestions? I know looping through every character is horrible, but I'm stumped and I was never taught Regex in depth.
Edit: I was asked how I was storing the string in the dictionary class! This is the constructor for another class that actually uses the formatted string.
public CodeBook(string book)
{
BookMap = new Dictionary<string, List<int>>();
Key = book.Split(null).ToList(); // split string into words
foreach(string s in Key)
{
if (!BookMap.Keys.Contains(s))
{
BookMap.Add(s, Enumerable.Range(0, Key.Count).Where(i => Key[i] == s).ToList());
// add word and add list of occurrances of word
}
}
}

This is slow because you construct string by concatenations in a loop. You have several approaches that are more performant:
Use StringBuilder - unlike string concatenation which constructs a new object each time you add a character, this approach expands the string under construction by larger chunks, preventing excessive garbage creation.
Use LINQ's filtering with Where - this approach constructs an array of chars in a single shot, then constructs a single string from it.
Use regular expression's Replace - this method is optimized to deal with strings of virtually unlimited sizes.
Roll your own algorithm - create an array of chars that corresponds to the length of the original string. Walk through the string, and add the characters that you wish to keep to the array. Use string's constructor that takes the array, the initial index, and the length to construct the string at once.

Looping through every character once is not that bad. You're doing it all in one pass, that's not trivial to avoid.
The problem lies in the fact that the framework will need to allocate a new copy of the (partial) string whenever you do something like
result += s[i];
You can avoid that by introducing a StringBuilder documented here to append non-punctuation characters as you go.
public string removePunctuation(string s)
{
var result = new StringBuilder();
for (int i = 0; i < s.Length; i++) {
if (Char.IsWhiteSpace(s[i])) {
result.Append(" ");
} else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
// do nothing
} else {
result.Append(s[i]);
}
}
return result.ToString();
}
You could further reduce the number of necessary Append calls with a refined algorithm, for example look ahead to the next punctuation and append larger portions at once, or use an existing string manipulation library like RegEx. But the introduction of StringBuilder above should give you a noticable performance gain already.
I was never taught Regex in depth
Use the search provider of your choice, you may end up with a tested solution which you can just study and use: https://stackoverflow.com/a/5871826/1132334

You can use Regex to remove punctuations as below.
public string removePunctuation(string s)
{
string result = Regex.Replace(s, #"[^\w\s]", "");
return result;
}
^ Means: not these characters (letters, numbers).
\w Means: word characters.
\s Means: space characters.

I need a clean way to identify record types based on contents

I have strings of eight characters like below. The presence of zeros in the last column identifies these as type pr, pa, fo or it records:
01020304
01020300
01020000
01000000
I already coded the following but it looks clumsy to me.
if ( id.Substring(2) == "000000") {
// pr record
} else if ( id.Substring(4) == "0000") {
// pa record
} else if ( id.Substring(6) == "00") {
// fo record
} else {
// it record
}
Can anyone think of a cleaner way to code this?

Not massively different to what you've got, just a bit more readable IMO.
const string PR = "000000";
const string PA = "0000";
const string FO = "00";
if (id.EndsWith(PR))
{
// pr record
}
else if(id.EndsWith(PA))
{
// pa record
}
else if (id.EndsWith(FO))
{
// fo record
}
else
{
// it record
}

By value.TrimEnd('0').Length ?

Nothing wrong with testing the strings, however:
substring probably produces a new string object
is there no endswith in C#?
it is fairly easy to write function that counts the number of trailing zeros. Then you could do a switch(trailingZeros(id)) { case 0: ... }

You may want to take a look at the Filehelpers library -- it's got all sorts of infrastructure to read and process records, including a way to determine different record types -- see "Multirecords".
Note: the main Filehelpers website is based on the 2.0 release of this library. A newer version exists in a public repository. In either case, I'd recommend grabbing the source code for the library, as I haven't seen a ton of activity in terms of new development for this library.

Hash Digest / Array Comparison in C#

I'm writing an application that needs to verify HMAC-SHA256 checksums. The code I currently have looks something like this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = System.Text.Encoding.UTF8.GetBytes(secret);
byte[] value = System.Text.Encoding.UTF8.GetBytes(data);
byte[] checksum_bytes = System.Text.Encoding.UTF8.GetBytes(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expected_bytes = hmac.ComputeHash(value);
return checksum_bytes.SequenceEqual(expected_bytes);
}
}
I know that this is susceptible to timing attacks.
Is there a message digest comparison function in the standard library? I realize I could write my own time hardened comparison method, but I have to believe that this is already implemented elsewhere.

EDIT: Original answer is below - still worth reading IMO, but regarding the timing attack...
The page you referenced gives some interesting points about compiler optimizations. Given that you know the two byte arrays will be the same length (assuming the size of the checksum isn't particularly secret, you can immediately return if the lengths are different) you might try something like this:
public static bool CompareArraysExhaustively(byte[] first, byte[] second)
{
if (first.Length != second.Length)
{
return false;
}
bool ret = true;
for (int i = 0; i < first.Length; i++)
{
ret = ret & (first[i] == second[i]);
}
return ret;
}
Now that still won't take the same amount of time for all inputs - if the two arrays are both in L1 cache for example, it's likely to be faster than if it has to be fetched from main memory. However, I suspect that is unlikely to cause a significant issue from a security standpoint.
Is this okay? Who knows. Different processors and different versions of the CLR may take different amounts of time for an & operation depending on the two operands. Basically this is the same as the conclusion of the page you referenced - that it's probably as good as we'll get in a portable way, but that it would require validation on every platform you try to run on.
At least the above code only uses relatively simple operations. I would personally avoid using LINQ operations here as there could be sneaky optimizations going on in some cases. I don't think there would be in this case - or they'd be easy to defeat - but you'd at least have to think about them. With the above code, there's at least a reasonably close relationship between the source code and IL - leaving "only" the JIT compiler and processor optimizations to worry about :)
Original answer
There's one significant problem with this: in order to provide the checksum, you have to have a string whose UTF-8 encoded form is the same as the checksum. There are plenty of byte sequences which simply don't represent UTF-8-encoded text. Basically, trying to encode arbitrary binary data as text using UTF-8 is a bad idea.
Base64, on the other hand, is basically designed for this:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
byte[] checksumBytes = Convert.FromBase64String(checksum);
using (var hmac = new HMACSHA256(key))
{
byte[] expectedBytes = hmac.ComputeHash(value);
return checksumBytes.SequenceEqual(expectedBytes);
}
}
On the other hand, instead of using SequenceEqual on the byte array, you could Base64 encode the actual hash, and see whether that matches:
static bool VerifyIntegrity(string secret, string checksum, string data)
{
// Verify HMAC-SHA256 Checksum
byte[] key = Encoding.UTF8.GetBytes(secret);
byte[] value = Encoding.UTF8.GetBytes(data);
using (var hmac = new HMACSHA256(key))
{
return checksum == Convert.ToBase64String(hmac.ComputeHash(value));
}
}
I don't know of anything better within the framework. It wouldn't be too hard to write a specialized SequenceEqual operator for arrays (or general ICollection<T> implementations) which checked for equal lengths first... but given that the hashes are short, I wouldn't worry about that.

If you're worried about the timing of the SequenceEqual, you could always replace it with something like this:
checksum_bytes.Zip( expected_bytes, (a,b) => a == b ).Aggregate( true, (a,r) => a && r );
This returns the same result as SequenceEquals but always check every element before given an answer this less chance of revealing anything through a timing attack.

How it is susceptible to timing attacks? Your code works the same amount of time in the case of valid or invalid digest. And calculate digest/check digest looks like the easiest way to check this.

.NET Regular expressions on bytes instead of chars

I'm trying to do some parsing that will be easier using regular expressions.
The input is an array (or enumeration) of bytes.
I don't want to convert the bytes to chars for the following reasons:
Computation efficiency
Memory consumption efficiency
Some non-printable bytes might be complex to convert to chars. Not all the bytes are printable.
So I can't use Regex.
The only solution I know, is using Boost.Regex (which works on bytes - C chars), but this is a C++ library that wrapping using C++/CLI will take considerable work.
How can I use regular expressions on bytes in .NET directly, without working with .NET strings and chars?
Thank you.

There is a bit of impedance mismatch going on here. You want to work with Regular expressions in .Net which use strings (multi-byte characters), but you want to work with single byte characters. You can't have both at the same time using .Net as per usual.
However, to break this mismatch down, you could deal with a string in a byte oriented fashion and mutate it. The mutated string can then act as a re-usable buffer. In this way you will not have to convert bytes to chars, or convert your input buffer to a string (as per your question).
An example:
//BLING
byte[] inputBuffer = { 66, 76, 73, 78, 71 };
string stringBuffer = new string('\0', 1000);
Regex regex = new Regex("ING", RegexOptions.Compiled);
unsafe
{
fixed (char* charArray = stringBuffer)
{
byte* buffer = (byte*)(charArray);
//Hard-coded example of string mutation, in practice you would
//loop over your input buffers and regex\match so that the string
//buffer is re-used.
buffer[0] = inputBuffer[0];
buffer[2] = inputBuffer[1];
buffer[4] = inputBuffer[2];
buffer[6] = inputBuffer[3];
buffer[8] = inputBuffer[4];
Console.WriteLine("Mutated string:'{0}'.",
stringBuffer.Substring(0, inputBuffer.Length));
Match match = regex.Match(stringBuffer, 0, inputBuffer.Length);
Console.WriteLine("Position:{0} Length:{1}.", match.Index, match.Length);
}
}
Using this technique you can allocate a string "buffer" which can be re-used as the input to Regex, but you can mutate it with your bytes each time. This avoids the overhead of converting\encoding your byte array into a new .Net string each time you want to do a match. This could prove to be very significant as I have seen many an algorithm in .Net try to go at a million miles an hour only to be brought to its knees by string generation and the subsequent heap spamming and time spent in GC.
Obviously this is unsafe code, but it is .Net.
The results of the Regex will generate strings though, so you have an issue here. I'm not sure if there is a way of using Regex that will not generate new strings. You can certainly get at the match index and length information but the string generation violates your requirements for memory efficiency.
Update
Actually after disassembling Regex\Match\Group\Capture, it looks like it only generates the captured string when you access the Value property, so you may at least not be generating strings if you only access index and length properties. However, you will be generating all the supporting Regex objects.

Well, if I faced this problem, I would DO the C++/CLI wrapper, except I'd create specialized code for what I want to achieve. Eventually develop the wrapper with time to do general things, but this just an option.
The first step is to wrap the Boost::Regex input and output only. Create specialized functions in C++ that do all the stuff you want and use CLI just to pass the input data to the C++ code and then fetch the result back with the CLI. This doesn't look to me like too much work to do.
Update:
Let me try to clarify my point. Even though I may be wrong, I believe you wont be able to find any .NET Binary Regex implementation that you could use. That is why - whether you like it or not - you will be forced to choose between CLI wrapper and bytes-to-chars conversion to use .NET's Regex. In my opinion the wrapper is better choice, because it will be working faster. I did not do any benchmarking, this is just an assumption based on:
Using wrapper you just have to cast
the pointer type (bytes <-> chars).
Using .NET's Regex you have to
convert each byte of the input.

As an alternative to using unsafe, just consider writing a simple, recursive comparer like:
static bool Evaluate(byte[] data, byte[] sequence, int dataIndex=0, int sequenceIndex=0)
{
if (sequence[sequenceIndex] == data[dataIndex])
{
if (sequenceIndex == sequence.Length - 1)
return true;
else if (dataIndex == data.Length - 1)
return false;
else
return Evaluate(data, sequence, dataIndex + 1, sequenceIndex + 1);
}
else
{
if (dataIndex < data.Length - 1)
return Evaluate(data, sequence, dataIndex+1, 0);
else
return false;
}
}
You could improve efficiency in a number of ways (i.e. seeking the first byte match instead of iterating, etc.) but this could get you started... hope it helps.

I personally went a different approach and wrote a small state machine that can be extended. I believe if parsing protocol data this is much more readable than regex.
bool ParseUDSResponse(PassThruMsg rxMsg, UDScmd.Mode txMode, byte txSubFunction, out UDScmd.Response functionResponse, out byte[] payload)
{
payload = new byte[0];
functionResponse = UDScmd.Response.UNKNOWN;
bool positiveReponse = false;
var rxMsgBytes = rxMsg.GetBytes();
//Iterate the reply bytes to find the echod ECU index, response code, function response and payload data if there is any
//If we could use some kind of HEX regex this would be a bit neater
//Iterate until we get past any and all null padding
int stateMachine = 0;
for (int i = 0; i < rxMsgBytes.Length; i++)
{
switch (stateMachine)
{
case 0:
if (rxMsgBytes[i] == 0x07) stateMachine = 1;
break;
case 1:
if (rxMsgBytes[i] == 0xE8) stateMachine = 2;
else return false;
case 2:
if (rxMsgBytes[i] == (byte)txMode + (byte)OBDcmd.Reponse.SUCCESS)
{
//Positive response to the requested mode
positiveReponse = true;
}
else if(rxMsgBytes[i] != (byte)OBDcmd.Reponse.NEGATIVE_RESPONSE)
{
//This is an invalid response, give up now
return false;
}
stateMachine = 3;
break;
case 3:
functionResponse = (UDScmd.Response)rxMsgBytes[i];
if (positiveReponse && rxMsgBytes[i] == txSubFunction)
{
//We have a positive response and a positive subfunction code (subfunction is reflected)
int payloadLength = rxMsgBytes.Length - i;
if(payloadLength > 0)
{
payload = new byte[payloadLength];
Array.Copy(rxMsgBytes, i, payload, 0, payloadLength);
}
return true;
} else
{
//We had a positive response but a negative subfunction error
//we return the function error code so it can be relayed
return false;
}
default:
return false;
}
}
return false;
}

Convert C# CUSTOM getSHA512 function into Ruby

I was wondering if someone could help me to get this method converted to ruby, is this possible at all?
public static string getSHA512(string str){
UnicodeEncoding UE = new UnicodeEncoding();
byte[] HashValue = null;
byte[] MessageBytes = UE.GetBytes(str);
System.Security.Cryptography.SHA512Managed SHhash = new System.Security.Cryptography.SHA512Managed();
string strHex = "";
HashValue = SHhash.ComputeHash(MessageBytes);
foreach (byte b in HashValue){
strHex += string.Format("{0:x2}", b);
}
return strHex;
}
Thanks in advance
UPDATE:
I just would like to make it clear that unfortunately it's method is not just for SHA512 generation but a custom one. I believe that the Digest::SHA512.hexdigest would be just the SHHast instance, but if you carefully look for the method you can see that it differs a bit from a simple hash generation.
Follows the result of both functions.
# in C#
getSHA512("hello") => "5165d592a6afe59f80d07436e35bd513b3055429916400a16c1adfa499c5a8ce03a370acdd4dc787d04350473bea71ea8345748578fc63ac91f8f95b6c140b93"
# in Ruby
Digest::SHA512.hexdigest("hello") || Digest::SHA2 => "9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043"

require 'digest/sha2'
class String
def sha512
Digest::SHA2.new(512).hexdigest(encode('UTF-16LE'))
end
end
'hello'.sha512 # => '5165d592a6afe59f80d07436e35bd…5748578fc63ac91f8f95b6c140b93'
As with all my code snippets on StackOverflow, I always assume the latest version of Ruby. Here's one that also works with Ruby 1.8:
require 'iconv'
require 'digest/sha2'
class String
def sha512(src_enc='UTF-8')
Digest::SHA2.new(512).hexdigest(Iconv.conv(src_enc, 'UTF-16LE', self))
end
end
'hello'.sha512 # => '5165d592a6afe59f80d07436e35bd…5748578fc63ac91f8f95b6c140b93'
Note that in this case, you have to know and tell Ruby about the encoding the string is in explicitly. In Ruby 1.9, Ruby always knows what encoding a string is in, and will convert it accordingly, when required. I chose UTF-8 as default encoding because it is backwards-compatible with ASCII, is the standard encoding on the internet and also otherwise widely used. However, for example both .NET and Java use UTF-16LE, not UTF-8. If your string is not UTF-8 or ASCII-encoded, you will have to pass in the encoding name into the sha512 method.
Off-Topic: 9 lines of code reduced to 1. I love Ruby!
Well, actually that is a little bit unfair. You could have written it something like this:
var messageBytes = new UnicodeEncoding().GetBytes(str);
var hashValue = new System.Security.Cryptography.SHA512Managed().
ComputeHash(messageBytes);
return hashValue.Aggregate<byte, string>("",
(s, b) => s += string.Format("{0:x2}", b)
);
Which is really only 3 lines (broken into 5 for StackOverflow's layout) and most importantly gets rid of that ugly 1950s-style explicit for loop for a nice 1960s-style fold (aka. reduce aka. inject aka. Aggregate aka. inject:into: … it's all the same).
There's probably an even more elegant way to write this, but a) I don't actually know C# and .NET and b) this question is about Ruby. Focus, Jörg, focus! :-)
Aaand … found it:
return string.Join("", from b in hashValue select string.Format("{0:x2}", b));
I knew there had to be an equivalent to Ruby's Enumerable#join somewhere, I just was looking in the wrong place.

Use the Digest::SHA2 class.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

fast encryption for a large unicode text file - c#

AES is pretty fast still, here's some help implementing it : Using AES encryption in C# Anything other than industry standard Encryption is asking for problems sooner or later.

What have you tried so far? Are standard encryptions like AES and blowfish too slow? You can always do something simple like xor-ing the contents against some pass-code repeated to the same length as the file.

Related

Removing punctuation from an extremely long string

I need a clean way to identify record types based on contents

Hash Digest / Array Comparison in C#

.NET Regular expressions on bytes instead of chars

Convert C# CUSTOM getSHA512 function into Ruby

Categories

Resources