Regex Match string with non-alphanumberic chars from a binary file

Regex Match string with non-alphanumberic chars from a binary file - c#

I am trying to extract some information from a binary file. It looks like this:
AUTHCODE(here goes 3 bytes, that I don't need)part_that_i_need(here goes a NULL byte).
How to I match the portion of alpha-numeric characters qszjlbnkmctkkezgd_qyzkyptqigudilzpkp_qgetefvmigwimrihudk that is between bytes {11} {00} {38} and {00}.
Here's what I've done so far:
string ReadFileMF;
using (StreamReader reader = new StreamReader(pathCopy))
{
ReadFileMF = reader.ReadToEnd();
}
///match the whole string
Match passMF = Regex.Match(ReadFileMF, #"(AUTHCODE).+?(www)");
String passMFs = passMF.Value;
//convert to array of bytes
byte[] bpass = StrToByteArray(passMFs);
//replace the 3 bytes after AUTHCODE with spaces
bpass[8] = 0x20;
bpass[9] = 0x20;
bpass[10] = 0x20;
Ok, so now I have just to match the nullbyte at the end. Somthing like (AUTHCODE).+?(NULL_BYTE). Any ideas?

This might be easiest with a few simple for-loops or Copy() actions over the byte data. Too few specs to be exact. Like:
// untested, I could be off-by-1 somewhere
int start = "AUTHCODE".Length + 3;
int end = text.Indexof('\0', start);
string result = text.SubString(start, end-start);
If you do want/need Regex, you'll have to turn it into a string first. Your only safe bet seems to be ASCII encoding.
string text = Encoding.ASCII.GetString(data);
and then (untested)
Regex.Match(text, "AUTHCODE.{3}([^\0x00]+)\0x00);

Related

Replace() working with hex value

I would like to use the Replace() method but using hex values instead of string value.
I have a programm in C# who write text file.
I don't know why, but when the programm write the '°' (-> Number) it's wrotten Â° ( in hex : C2 B0 instead of B0).
I just would like to patch it, in order to corect this.
Is it possible to do re place in order to replace C2B0 by B0 ? How doing this ?
Thanks a lot :)

Not sure if this is the best solution for your problem but if you want a replace function for a string using hex values this will work:
var newString = HexReplace(sourceString, "C2B0", "B0");
private static string HexReplace(string source, string search, string replaceWith) {
var realSearch = string.Empty;
var realReplace = string.Empty;
if(search.Length % 2 == 1) throw new Exception("Search parameter incorrect!");
for (var i = 0; i < search.Length / 2; i++) {
var hex = search.Substring(i * 2, 2);
realSearch += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
for (var i = 0; i < replaceWith.Length / 2; i++) {
var hex = replaceWith.Substring(i * 2, 2);
realReplace += (char)int.Parse(hex, System.Globalization.NumberStyles.HexNumber);
}
return source.Replace(realSearch, realReplace);
}

C# strings are Unicode. When they are written to a file, an encoding must be applied. The default encoding used by File.WriteAllText is utf-8 with no byte order mark.
The two-byte sequence 0xC2B0 is the representation of the ° degree sign U+00B0 codepoint in utf-8.
To get rid of the 0xC2 part, apply a different encoding, for example latin-1:
var latin1 = Encoding.GetEncoding(1252);
File.WriteAllText(path, text, latin1);
To address the "hex replace" idea of the question: Best practice to remove the utf-8 leading byte from existing files would be to do a ReadAllText with utf-8, followed by a WriteAllText as shown above (or stream chunking if the files are too big to read to memory as a whole).
Single-byte character encodings cannot represent all Unicode characters, so substitution will happen for any such character in your DataTable.
The rendition as Â° must be blamed on the viewer/editor you are using to display the file.
Further reading: https://stackoverflow.com/a/17269952/1132334

C# ByteString to ASCII String

I am looking for a smart way to convert a string of hex-byte-values into a string of 'real text' (ASCII Characters).
For example I have the word "Hello" written in Hexadecimal ASCII: 48 45 4C 4C 4F. And using some method I want to receive the ASCII text of it (in this case "Hello").
// I have this string (example: "Hello") and want to convert it to "Hello".
string strHexa = "48454C4C4F";
// I want to convert the strHexa to an ASCII string.
string strResult = ConvertToASCII(strHexa);
I am sure there is a framework method. If this is not the case of course I could implement my own method.
Thanks!

var str = Encoding.UTF8.GetString(SoapHexBinary.Parse("48454C4C4F").Value); //HELLO
PS: SoapHexBinary is in System.Runtime.Remoting.Metadata.W3cXsd2001 namespace

I am sure there is a framework method.
A a single framework method: No.
However the second part of this: converting a byte array containing ASCII encoded text into a .NET string (which is UTF-16 encoded Unicode) does exist: System.Text.ASCIIEncoding and specifically the method GetString:
string result = ASCIIEncoding.GetString(byteArray);
The First part is easy enough to do yourself: take two hex digits at a time, parse as hex and cast to a byte to store in the array. Seomthing like:
byte[] HexStringToByteArray(string input) {
Debug.Assert(input.Length % 2 == 0, "Must have two digits per byte");
var res = new byte[input.Length/2];
for (var i = 0; i < input.Length/2; i++) {
var h = input.Substring(i*2, 2);
res[i] = Convert.ToByte(h, 16);
}
return res;
}
Edit: Note: L.B.'s answer identifies a method in .NET that will do the first part more easily: this is a better approach that writing it yourself (while in a, perhaps, obscure namespace it is implemented in mscorlib rather than needing an additional reference).

StringBuilder sb = new StringBuilder();
for (int i = 0; i < hexStr.Length; i += 2)
{
string hs = hexStr.Substring(i, 2);
sb.Append(Convert.ToByte(hs, 16));
}

How to get data off of a character

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:
char ch = '\u00e9';
print (ch);
It will print "é". I have tried getting the bytes off of a given string using:
byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);
While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.
What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.
Here is the code that is referenced:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;
public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
formatParams = new string[2];
doc = new HtmlDocument();
returnString = "";
}
public string Translate(String input, String languagePair, Encoding encoding)
{
formatParams[0]= input;
formatParams[1]= languagePair;
string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);
string result = String.Empty;
using (WebClient webClient = new WebClient())
{
webClient.Encoding = encoding;
result = webClient.DownloadString(url);
}
doc.LoadHtml(result);
input = alter (input);
string temp = doc.DocumentNode.SelectSingleNode("//span[#title='"+input+"']").InnerText;
charString = temp.ToCharArray();
return temp;
}
// Use this for initialization
void Start () {
}
string alter(string inputString){
returnString = "";
letters = inputString.ToCharArray();
for(int i=0; i<inputString.Length;i++){
if(letters[i]=='\''){
returnString = returnString + "'";
}else{
returnString = returnString + letters[i];
}
}
return returnString;
}
}

Maybe you should use another API/URL. This function below uses a different url that returns JSON data and seems to work better:
public static string Translate(string input, string fromLanguage, string toLanguage)
{
using (WebClient webClient = new WebClient())
{
string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
string result = webClient.DownloadString(url);
// I used JavaScriptSerializer but another JSON parser would work
JavaScriptSerializer serializer = new JavaScriptSerializer();
Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
return (string)sentences["trans"];
}
}
If I run this in a Console App:
Console.WriteLine(Translate("How are you?", "en", "es"));
It will display
¿Cómo estás?

I don't know much about the GoogleTranslate API, but my first thought is that you've got a Unicode Normalization problem.
Have a look at System.String.Normalize() and it's friends.
Unicode is very complicated, so I'll over simplify! Many symbols can be represented in different ways in Unicode, that is: 'é' could be represented as 'é' (one character), or as an 'e' + 'accent character' (two characters), or, depending what comes back from the API, something else altogether.
The Normalize function will convert your string to one with the same Textual meaning, but potentially a different binary value which may fix your output problem.

You actually pretty much have it. Just insert the coded letter with a \u and it works.
string mystr = "C\u00f3mo Est\u00e1s?";

There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).
The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.
So in your case, unless you really need to be using byte[] arrays, you should use char[] arrays. Or if you want to encode the characters so that they can be used in HTML, then you can just iterate through the characters and check if the character code is > 128, then you can replace it with the &hex; character code.

I had the same problem working one of my project [Language Resource Localization Translation]
I was doing the same thing and was using.. System.Text.Encoding.UTF8.GetBytes() and because of utf8 encoding was receiving special characters like your
e.g 239, 191, 189 in result string.
please take a look of my solution... hope this helps
Don't Use encoding at all Google translation will return correct like á as it self in the string. do some string manipulation and read the string as it is...
Generic Solution [works for every language translation which google support]
try
{
//Don't use UtF Encoding
// use default webclient encoding
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start +7 , length - 8);
objDic2.Text = reverseString(result);
//hard code substring; finding the correct translation within the string.
dictList.Add(objDic2);
}
catch (Exception ex)
{
lblMessages.InnerHtml = "<strong>Google translate exception occured no resource saved..." + ex.Message + "</strong>";
error = true;
}
public static string reverseString(string s)
{
char[] arr = s.ToCharArray();
Array.Reverse(arr);
return new string(arr);
}
as you can see from the code no encoding has been performed and i am sending 2 special key charachters as "►" + txtNewResourceValue.Text.Trim() + "◄"to determine the start and end of the return translation from google.
Also i have checked hough my language utility tool I am getting "Cómo Estás?" when sending
How are you to google translation... :)
Best regards
[Shaz]
---------------------------Edited-------------------------
public string Translate(String input, String languagePair)
{
try
{
//Don't use UtF Encoding
// use default webclient encoding
//input [string to translate]
//Languagepair [eg|es]
var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);
var webClient = new WebClient();
string result = webClient.DownloadString(url); //get all data from google translate
int start = result.IndexOf("id=result_box");
int end = result.IndexOf("id=spell-place-holder");
int length = end - start;
result = result.Substring(start, length);
result = reverseString(result);
start = result.IndexOf(";8669#&");//◄
end = result.IndexOf(";8569#&"); //►
length = end - start;
result = result.Substring(start + 7, length - 8);
//return transalted string
return reverseString(result);
}
catch (Exception ex)
{
return "Google translate exception occured no resource saved..." + ex.Message";
}
}

how to convert string of unicodes to the char?

i have a text file in which sets of Unicodes are written as
"'\u0641'","'\u064A','\u0649','\u0642','\u0625','\u0644','\u0627','\u0647','\u0631','\u062A','\u0643','\u0645','\u0639','\u0648','\u0623','\u0646','\u0636','\u0635','\u0633','\u0641','\u062D','\u0628','\u0650','\u064E','\u062C','\u0626"
"'\u0622'","'\u062E','\u0644','\u064A','\u0645".
I opened the file and started reading of file by using readline method. I got the above line shown as a line now i want to convert all Unicode to char so that i could get a readable string. i tried some logic but that doesn't worked i stuck with converting string "'\u00641'" to char.

You can extract strings containing individual numbers (using Regex for example), apply Int16.Parse to each and then convert it to a char.
string num = "0641"; // replace it with extracting logic of your preference
char c = (char)Int16.Parse(num, System.Globalization.NumberStyles.HexNumber);

You could parse the line to get each unicode char. To convert unicode to readable character you could do
char MyChar = '\u0058';
Hope this help

What if you do something like this:
string codePoints = "\u0641 \u064A \u0649 \u0642 \u0625";
UnicodeEncoding uEnc = new UnicodeEncoding();
byte[] bytesToWrite = uEnc.GetBytes(codePoints);
System.IO.File.WriteAllBytes(#"yadda.txt", bytesToWrite);
byte[] readBytes = System.IO.File.ReadAllBytes(#"yadda.txt");
string val = uEnc.GetString(readBytes);
//daniel

How to convert an MD5 hash to a string and use it as a file name

I am taking the MD5 hash of an image file and I want to use the hash as a filename.
How do I convert the hash to a string that is valid filename?
EDIT: toString() just gives "System.Byte[]"

How about this:
string filename = BitConverter.ToString(yourMD5ByteArray);
If you prefer a shorter filename without hyphens then you can just use:
string filename =
BitConverter.ToString(yourMD5ByteArray).Replace("-", string.Empty);

System.Convert.ToBase64String
As a commenter pointed out -- normal base 64 encoding can contain a '/' character, which obivously will be a problem with filenames. However, there are other characters that are usable, such as an underscore - just replace all the '/' with an underscore.
string filename = Convert.ToBase64String(md5HashBytes).Replace("/","_");

Try this:
Guid guid = new Guid(md5HashBytes);
string hashString = guid.ToString("N");
// format: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
string hashString = guid.ToString("D");
// format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
string hashString = guid.ToString("B");
// format: {xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
string hashString = guid.ToString("P");
// format: (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)

This is probably the safest for file names. You always get a hex string and never worry about / or +, etc.
byte[] data = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(inputString));
StringBuilder sBuilder = new StringBuilder();
for (int i = 0; i < data.Length; i++)
{
sBuilder.Append(data[i].ToString("x2"));
}
string hashed = sBuilder.ToString();

Try this:
string Hash = Convert.ToBase64String(MD5.Create().ComputeHash(Encoding.UTF8.GetBytes("sample")));
//input "sample" returns = Xo/5v1W6NQgZnSLphBKb5g==
or
string Hash = BitConverter.ToString(MD5.Create().ComputeHash(Encoding.UTF8.GetBytes("sample")));
//input "sample" returns = 5E-8F-F9-BF-55-BA-35-08-19-9D-22-E9-84-12-9B-E6

Technically using Base64 is bad if this is Windows, filenames are case insensitive (at least in explorers view).. but in base64, 'a' is different to 'A', this means that perhaps unlikely but you end up with even higher rate of collision..
A better alternative is hexadecimal like the bitconverter class, or if you can- use base32 encoding (which after removing the padding from both base64 and base32, and in the case of 128bit, will give you similar length filenames).

Try Base32 hash of MD5. It gives filename-safe case insensitive strings.
string Base32Hash(string input)
{
byte[] buf = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(input));
return String.Join("", buf.Select(b => "abcdefghijklmonpqrstuvwxyz234567"[b & 0x1F]));
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Match string with non-alphanumberic chars from a binary file - c#

Related

Replace() working with hex value

C# ByteString to ASCII String

How to get data off of a character

how to convert string of unicodes to the char?

How to convert an MD5 hash to a string and use it as a file name

Categories

Resources