How to remove specific characters with Encoding.Convert

How to remove specific characters with Encoding.Convert - c#

I'm sending signed XML via WebClient to a gateway. Now I have to ensure, that the node values only contain german letters. I have 2 Testwords. The first gets very well converted by using:
string foreignString = "ŁůjęŁĄü";
Encoding utf8 = Encoding.UTF8;
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
byte[] utfBytes = Encoding.Convert(iso, utf8, iso.GetBytes(foreignString));
string result = utf8.GetString(utfBytes);
But in the second string is a character which is also included in the UTF-8 Encoding. Its the
ç (Latin small letter c with cedilla)
After testing a little bit with other Encoding I always got the same result: the character was always there. What makes sense, because it is part of the UTF-8 table :)
So my question is: is there a way to mask out all the french, portuguese and spanish characters without dropping the german umlauts ?
Thanks in advance!

You can create your own Encoding class based on the ISO-8859-1 encoding with your additional special rules:
class GermanEncoding : Encoding {
static readonly Encoding iso88791Encoding = Encoding.GetEncoding("ISO-8859-1");
static readonly Dictionary<Char, Char> charMappingTable = new Dictionary<Char, Char> {
{ 'À', 'A' },
{ 'Á', 'A' },
{ 'Â', 'A' },
{ 'ç', 'c' },
// Add more mappings
};
static readonly Dictionary<Byte, Byte> byteMappingTable = charMappingTable
.ToDictionary(kvp => MapCharToByte(kvp.Key), kvp => MapCharToByte(kvp.Value));
public override Int32 GetByteCount(Char[] chars, Int32 index, Int32 count) {
return iso88791Encoding.GetByteCount(chars, index, count);
}
public override Int32 GetBytes(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex) {
var count = iso88791Encoding.GetBytes(chars, charIndex, charCount, bytes, byteIndex);
for (var i = byteIndex; i < byteIndex + count; ++i)
if (byteMappingTable.ContainsKey(bytes[i]))
bytes[i] = byteMappingTable[bytes[i]];
return count;
}
public override Int32 GetCharCount(Byte[] bytes, Int32 index, Int32 count) {
return iso88791Encoding.GetCharCount(bytes, index, count);
}
public override Int32 GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteCount, Char[] chars, Int32 charIndex) {
return iso88791Encoding.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
}
public override Int32 GetMaxByteCount(Int32 charCount) {
return iso88791Encoding.GetMaxByteCount(charCount);
}
public override Int32 GetMaxCharCount(Int32 byteCount) {
return iso88791Encoding.GetMaxCharCount(byteCount);
}
static Byte MapCharToByte(Char c) {
// NOTE: Assumes that each character encodes as a single byte.
return iso88791Encoding.GetBytes(new[] { c })[0];
}
}
This encoding is based on the fact that you want to use the ISO-8859-1 encoding with some additional restrictions where you want to map "non-german" characters to their ASCII equivalent. The built-in ISO-8859-1 encoding knows how to map Ł to L and because ISO-8859-1 is a single byte character set you can do additional mapping on the bytes because each byte corresponds to a character. This is done in the GetBytes method.
You can "clean" a string using this code:
var encoding = new GermanEncoding();
string foreignString = "ŁůjęŁĄüç";
var bytes = encoding.GetBytes(foreignString);
var result = encoding.GetString(bytes);
The resulting string is LujeLAüc.
Note that the implementation is quite simplistic and it uses a dictionary to perform the additional mapping step of the bytes. This might not be efficient but in that case you can consider alternatives like using a 256 byte mapping array. Also, you need to expand the charMappingTable to contain all the additional mappings you want to perform.

Related

Helpful byte array extensions to handle BigEndian data

I recently came across the issue of needing to frequently (& easily) convert & extract bytes from a BigEndian byte array that was being received over TCP. For those unfamiliar with Big/Little Endian, the short version is that each number was sent as MSB then LSB.
When processing and extracting values from the byte array using the default BitConverter on a Little Endian system (read: just about any C#/windows machine), this is a problem.

Here are some handy extension methods that I created in my project that I thought I would share with fellow readers. The trick here is to only reverse the bytes required for the requested data size.
As a bonus, the byte[ ] extension methods makes the code tidy (in my opinion).
Comments & improvements are welcome.
Syntax to use:
var bytes = new byte[] { 0x01, 0x02, 0x03, 0x04 };
var uint16bigend = bytes.GetUInt16BigE(2); // MSB-LSB ... 03-04 = 772
var uint16bitconv = bytes.GetUInt16(2); // LSB-MSB ... 04-03 = 1027
Here is the class with the initial set of extensions. Easy for readers to extend & customize:
public static class BitConverterExtensions
{
public static UInt16 GetUInt16BigE(this byte[] bytes, int startIndex)
{
return BitConverter.ToUInt16(bytes.Skip(startIndex).Take(2).Reverse().ToArray(), 0);
}
public static UInt32 GetUInt32BigE(this byte[] bytes, int startIndex)
{
return BitConverter.ToUInt32(bytes.Skip(startIndex).Take(4).Reverse().ToArray(), 0);
}
public static Int16 GetInt16BigE(this byte[] bytes, int startIndex)
{
return BitConverter.ToInt16(bytes.Skip(startIndex).Take(2).Reverse().ToArray(), 0);
}
public static Int32 GetInt32BigE(this byte[] bytes, int startIndex)
{
return BitConverter.ToInt32(bytes.Skip(startIndex).Take(4).Reverse().ToArray(), 0);
}
public static UInt16 GetUInt16(this byte[] bytes, int startIndex)
{
return BitConverter.ToUInt16(bytes, startIndex);
}
public static UInt32 GetUInt32(this byte[] bytes, int startIndex)
{
return BitConverter.ToUInt32(bytes, startIndex);
}
public static Int16 GetInt16(this byte[] bytes, int startIndex)
{
return BitConverter.ToInt16(bytes, startIndex);
}
public static Int32 GetInt32(this byte[] bytes, int startIndex)
{
return BitConverter.ToInt32(bytes, startIndex);
}
}

You might want to look at using an endian-aware BinaryReader/BinaryWriter implemention. Here are links to some:
Jon Skeet's answer to the question, BinaryWriter Endian issue
Anculus.Core.IO has an endian-aware BinaryReader, BinaryWriter and BitConverter: https://code.google.com/p/libanculus-sharp/source/browse/trunk/src/Anculus.Core/IO/?r=227
The Iso-Parser project (parser for parsing ISO disk images) has an endian-aware BitConverter
Rabbit MQ (Rabbit Message Queue) has an big-endian BinaryReader and BinaryWriter on Github at https://github.com/rabbitmq/rabbitmq-dotnet-client.

Are there any rules for the XOR cipher?

I have the following method which takes the plain text and the key text. It is supposed to return a string encrypted with the XOR method as ascii.
public static string encryptXOREng(string plainText,string keyText)
{
StringBuilder chiffreText = new StringBuilder();
byte[] binaryPlainText = System.Text.Encoding.ASCII.GetBytes(plainText);
byte[] binaryKeyText = System.Text.Encoding.ASCII.GetBytes(keyText);
for(int i = 0;i<plainText.Length;i++)
{
int result = binaryPlainText[i] ^ binaryKeyText[i];
chiffreText.Append(Convert.ToChar(result));
}
return chiffreText.ToString();
}
For some characters it runs just fine. But for example if it performs XOR on 'G' & 'M', which is 71 XOR 77 it returns 10. And 10 stands for Line feed. This is then actually not represented by a character in my output. This leads to a plain text of a length being encrypted to a cipher string which is only 2 characters long, in some cases. I suppose this would make a decryption impossible, even with a key? Or are the ascii characters 0 - 31 there but simply not visible?

To avoid non printable chars use Convert.ToBase64String
public static string encryptXOREng(string plainText, string keyText)
{
List<byte> chiffreText = new List<byte>();
byte[] binaryPlainText = System.Text.Encoding.ASCII.GetBytes(plainText);
byte[] binaryKeyText = System.Text.Encoding.ASCII.GetBytes(keyText);
for (int i = 0; i < plainText.Length; i++)
{
int result = binaryPlainText[i] ^ binaryKeyText[i % binaryKeyText.Length];
chiffreText.Add((byte)result);
}
return Convert.ToBase64String(chiffreText.ToArray());
}
PS: In your code you assume keyText is not shorter than plainText, I fixed it also.

As far as i know there are no rules specific to xor-ciphers. Cryptographic functions often output values that are not printable, which makes sense - the result is not supposed to be readable. In stead you may want to use the output bytes directly or a base64 encoded result.
I would do something like:
public static byte[] XORCipher(string plainText, string keyText)
{
byte[] binaryPlainText = System.Text.Encoding.ASCII.GetBytes(plainText);
byte[] binaryKeyText = System.Text.Encoding.ASCII.GetBytes(keyText);
for(int i = 0;i<plainText.Length;i++)
{
binaryPlainText[i] ^= binaryKeyText[i];
}
return binaryPlainText;
}

C# Encoding Conversion

I have an encrypt routine in c++, I translate this to C#:
example:
public void main()
{
string myPwd = "ÖFÖæ6";
string pwdCoded = XEncrypt.EncryptData_Patch_x_Net(myPwd);
//Result OK: ÖFÖæ–6
}
public static string EncryptData_Patch_x_Net(string Data)
{
byte[] bytes = new byte[Data.Length];
for (int n = 0; n < Data.Length; n++)
{
bytes[n] = (byte)Data[n];
}
System.Text.Encoding MyEncoding = System.Text.Encoding.Default;
String MyResult = MyEncoding.GetString(bytes);
return MyResult;
}
I need to make the inverse routine that made it convert from:
ÖFÖæ–6 to ÖFÖæ6 (notice there's a dash in the left string)
I did this last function, but erroneously performs the encoding
public static string DecryptData_Patch_x_Net(string Data)
{
byte[] bytes = new byte[Data.Length];
for (int n = 0; n < Data.Length; n++)
{
bytes[n] = (byte)Data[n];
}
System.Text.Encoding MyEncoding = System.Text.Encoding.GetEncoding(1252);
String MyResult = MyEncoding.GetString(bytes);
return MyResult;
}

This is not encryption and you are seriously complicating what it actually is.
Encoding iso88591 = Encoding.GetEncoding(28591);
Encoding w1252 = Encoding.GetEncoding(1252);
//
string pwd = "ÖFÖæ\u00966"; //The SPA control character will not survice a Stackoverflow post
//So I use \u0096 to represent it
string result = w1252.GetString(iso88591.GetBytes(pwd)); //"ÖFÖæ–6"
string original = iso88591.GetString(w1252.GetBytes(result)); //"ÖFÖæ6" with the hidden control character before 6
Console.WriteLine(result == "ÖFÖæ–6"); //True
Console.WriteLine(original == "ÖFÖæ\u00966"); //True

Your misnamed ...Encrypt... function makes a fundamental error. You take a string, which treat as a char[] (thats fine), then explicitly cast each char to a byte. That is a narrowing conversion. You'll lose any of the high bits and the ability to round trip more unusual chars. If you look at this question it should help to understand.
You could use this function to get the bytes without loss of information,
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
The byte array will round trip on systems that share endianess.
As Esailija states, becuase its simple and it will explicitly return little endian results, you're better off calling
byte[] Encoding.Unicode.GetBytes(string)
To achieve the same.

ASCIIEncoding In Windows Phone 7

Is there a way to use ASCIIEncoding in Windows Phone 7?
Unless I'm doing something wrong Encoding.ASCII doesn't exist and I'm needing it for C# -> PHP encryption (as PHP only uses ASCII in SHA1 encryption).
Any suggestions?

It is easy to implement yourself, Unicode never messed with the ASCII codes:
public static byte[] StringToAscii(string s) {
byte[] retval = new byte[s.Length];
for (int ix = 0; ix < s.Length; ++ix) {
char ch = s[ix];
if (ch <= 0x7f) retval[ix] = (byte)ch;
else retval[ix] = (byte)'?';
}
return retval;
}

Not really seeing any detail in your question this could be off track. You are right Silverlight has no support for the ASCII encoding.
However I suspect that in fact UTF8 will do what you need. Its worth bearing in mind that a sequence of single byte ASCII only characters and the same set of characters encoded as UTF-8 are identical. That is the the complete ASCII character set is repeated verbatim by the first 128 single byte code points in UTF-8.

I have a Silverlight app that writes CSV files, which have to be encoded in ASCII (using UTF-8 causes accented characters to show up wrong when you open the files in Excel).
Since Silverlight doesn't have an Encoding.ASCII class, I implemented one as follows. It works for me, hope it's useful to you as well:
/// <summary>
/// Silverlight doesn't have an ASCII encoder, so here is one:
/// </summary>
public class AsciiEncoding : System.Text.Encoding
{
public override int GetMaxByteCount(int charCount)
{
return charCount;
}
public override int GetMaxCharCount(int byteCount)
{
return byteCount;
}
public override int GetByteCount(char[] chars, int index, int count)
{
return count;
}
public override byte[] GetBytes(char[] chars)
{
return base.GetBytes(chars);
}
public override int GetCharCount(byte[] bytes)
{
return bytes.Length;
}
public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
for (int i = 0; i < charCount; i++)
{
bytes[byteIndex + i] = (byte)chars[charIndex + i];
}
return charCount;
}
public override int GetCharCount(byte[] bytes, int index, int count)
{
return count;
}
public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
{
for (int i = 0; i < byteCount; i++)
{
chars[charIndex + i] = (char)bytes[byteIndex + i];
}
return byteCount;
}
}

I had similar problem using Xamarin (Mono) for Android where I'm using Portable Class Library and they don't support Econding.ASCII.
Instead, the only working solution (except doing it manually) is this one
Uri.EscapeDataString(yourString);
See this answer which provide additional information.

I started from #Hans Passant 's answer and I rewrote it with Linq :
/// <summary>
/// Gets an encoding for the ASCII (7-bit) character set.
/// </summary>
/// <see cref="http://stackoverflow.com/a/4022893/1248177"/>
/// <param name="s">A character set.</param>
/// <returns>An encoding for the ASCII (7-bit) character set.</returns>
public static byte[] StringToAscii(string s)
{
return (from char c in s select (byte)((c <= 0x7f) ? c : '?')).ToArray();
}
You may want to remove the call to ToArray() and return a IEnumerable<byte> instead of byte[].

According to this MS forum thread, Windows Phone 7 does not support Encoding.ASCII.

How to convert UTF-8 byte[] to string

I have a byte[] array that is loaded from a file that I happen to known contains UTF-8.
In some debugging code, I need to convert it to a string. Is there a one-liner that will do this?
Under the covers it should be just an allocation and a memcopy, so even if it is not implemented, it should be possible.

string result = System.Text.Encoding.UTF8.GetString(byteArray);

There're at least four different ways doing this conversion.
Encoding's GetString, but you won't be able to get the original bytes back if those bytes have non-ASCII characters.
BitConverter.ToString The output is a "-" delimited string, but there's no .NET built-in method to convert the string back to byte array.
Convert.ToBase64String You can easily convert the output string back to byte array by using Convert.FromBase64String. Note: The output string could contain '+', '/' and '='. If you want to use the string in a URL, you need to explicitly encode it.
HttpServerUtility.UrlTokenEncodeYou can easily convert the output string back to byte array by using HttpServerUtility.UrlTokenDecode. The output string is already URL friendly! The downside is it needs System.Web assembly if your project is not a web project.
A full example:
byte[] bytes = { 130, 200, 234, 23 }; // A byte array contains non-ASCII (or non-readable) characters
string s1 = Encoding.UTF8.GetString(bytes); // ���
byte[] decBytes1 = Encoding.UTF8.GetBytes(s1); // decBytes1.Length == 10 !!
// decBytes1 not same as bytes
// Using UTF-8 or other Encoding object will get similar results
string s2 = BitConverter.ToString(bytes); // 82-C8-EA-17
String[] tempAry = s2.Split('-');
byte[] decBytes2 = new byte[tempAry.Length];
for (int i = 0; i < tempAry.Length; i++)
decBytes2[i] = Convert.ToByte(tempAry[i], 16);
// decBytes2 same as bytes
string s3 = Convert.ToBase64String(bytes); // gsjqFw==
byte[] decByte3 = Convert.FromBase64String(s3);
// decByte3 same as bytes
string s4 = HttpServerUtility.UrlTokenEncode(bytes); // gsjqFw2
byte[] decBytes4 = HttpServerUtility.UrlTokenDecode(s4);
// decBytes4 same as bytes

A general solution to convert from byte array to string when you don't know the encoding:
static string BytesToStringConverted(byte[] bytes)
{
using (var stream = new MemoryStream(bytes))
{
using (var streamReader = new StreamReader(stream))
{
return streamReader.ReadToEnd();
}
}
}

Definition:
public static string ConvertByteToString(this byte[] source)
{
return source != null ? System.Text.Encoding.UTF8.GetString(source) : null;
}
Using:
string result = input.ConvertByteToString();

Converting a byte[] to a string seems simple, but any kind of encoding is likely to mess up the output string. This little function just works without any unexpected results:
private string ToString(byte[] bytes)
{
string response = string.Empty;
foreach (byte b in bytes)
response += (Char)b;
return response;
}

I saw some answers at this post and it's possible to be considered completed base knowledge, because I have a several approaches in C# Programming to resolve the same problem. The only thing that is necessary to be considered is about a difference between pure UTF-8 and UTF-8 with a BOM.
Last week, at my job, I needed to develop one functionality that outputs CSV files with a BOM and other CSV files with pure UTF-8 (without a BOM). Each CSV file encoding type will be consumed by different non-standardized APIs. One API reads UTF-8 with a BOM and the other API reads without a BOM. I needed to research the references about this concept, reading the "What's the difference between UTF-8 and UTF-8 without BOM?" Stack Overflow question, and the Wikipedia article "Byte order mark" to build my approach.
Finally, my C# Programming for both UTF-8 encoding types (with BOM and pure) needed to be similar to this example below:
// For UTF-8 with BOM, equals shared by Zanoni (at top)
string result = System.Text.Encoding.UTF8.GetString(byteArray);
//for Pure UTF-8 (without B.O.M.)
string result = (new UTF8Encoding(false)).GetString(byteArray);

Using (byte)b.ToString("x2"), Outputs b4b5dfe475e58b67
public static class Ext {
public static string ToHexString(this byte[] hex)
{
if (hex == null) return null;
if (hex.Length == 0) return string.Empty;
var s = new StringBuilder();
foreach (byte b in hex) {
s.Append(b.ToString("x2"));
}
return s.ToString();
}
public static byte[] ToHexBytes(this string hex)
{
if (hex == null) return null;
if (hex.Length == 0) return new byte[0];
int l = hex.Length / 2;
var b = new byte[l];
for (int i = 0; i < l; ++i) {
b[i] = Convert.ToByte(hex.Substring(i * 2, 2), 16);
}
return b;
}
public static bool EqualsTo(this byte[] bytes, byte[] bytesToCompare)
{
if (bytes == null && bytesToCompare == null) return true; // ?
if (bytes == null || bytesToCompare == null) return false;
if (object.ReferenceEquals(bytes, bytesToCompare)) return true;
if (bytes.Length != bytesToCompare.Length) return false;
for (int i = 0; i < bytes.Length; ++i) {
if (bytes[i] != bytesToCompare[i]) return false;
}
return true;
}
}

There is also class UnicodeEncoding, quite simple in usage:
ByteConverter = new UnicodeEncoding();
string stringDataForEncoding = "My Secret Data!";
byte[] dataEncoded = ByteConverter.GetBytes(stringDataForEncoding);
Console.WriteLine("Data after decoding: {0}", ByteConverter.GetString(dataEncoded));

In addition to the selected answer, if you're using .NET 3.5 or .NET 3.5 CE, you have to specify the index of the first byte to decode, and the number of bytes to decode:
string result = System.Text.Encoding.UTF8.GetString(byteArray, 0, byteArray.Length);

Alternatively:
var byteStr = Convert.ToBase64String(bytes);

The BitConverter class can be used to convert a byte[] to string.
var convertedString = BitConverter.ToString(byteAttay);
Documentation of BitConverter class can be fount on MSDN.

To my knowledge none of the given answers guarantee correct behavior with null termination. Until someone shows me differently I wrote my own static class for handling this with the following methods:
// Mimics the functionality of strlen() in c/c++
// Needed because niether StringBuilder or Encoding.*.GetString() handle \0 well
static int StringLength(byte[] buffer, int startIndex = 0)
{
int strlen = 0;
while
(
(startIndex + strlen + 1) < buffer.Length // Make sure incrementing won't break any bounds
&& buffer[startIndex + strlen] != 0 // The typical null terimation check
)
{
++strlen;
}
return strlen;
}
// This is messy, but I haven't found a built-in way in c# that guarentees null termination
public static string ParseBytes(byte[] buffer, out int strlen, int startIndex = 0)
{
strlen = StringLength(buffer, startIndex);
byte[] c_str = new byte[strlen];
Array.Copy(buffer, startIndex, c_str, 0, strlen);
return Encoding.UTF8.GetString(c_str);
}
The reason for the startIndex was in the example I was working on specifically I needed to parse a byte[] as an array of null terminated strings. It can be safely ignored in the simple case

A LINQ one-liner for converting a byte array byteArrFilename read from a file to a pure ASCII C-style zero-terminated string would be this: Handy for reading things like file index tables in old archive formats.
String filename = new String(byteArrFilename.TakeWhile(x => x != 0)
.Select(x => x < 128 ? (Char)x : '?').ToArray());
I use '?' as the default character for anything not pure ASCII here, but that can be changed, of course. If you want to be sure you can detect it, just use '\0' instead, since the TakeWhile at the start ensures that a string built this way cannot possibly contain '\0' values from the input source.

Try this console application:
static void Main(string[] args)
{
//Encoding _UTF8 = Encoding.UTF8;
string[] _mainString = { "Hello, World!" };
Console.WriteLine("Main String: " + _mainString);
// Convert a string to UTF-8 bytes.
byte[] _utf8Bytes = Encoding.UTF8.GetBytes(_mainString[0]);
// Convert UTF-8 bytes to a string.
string _stringuUnicode = Encoding.UTF8.GetString(_utf8Bytes);
Console.WriteLine("String Unicode: " + _stringuUnicode);
}

Here is a result where you didn’t have to bother with encoding. I used it in my network class and send binary objects as string with it.
public static byte[] String2ByteArray(string str)
{
char[] chars = str.ToArray();
byte[] bytes = new byte[chars.Length * 2];
for (int i = 0; i < chars.Length; i++)
Array.Copy(BitConverter.GetBytes(chars[i]), 0, bytes, i * 2, 2);
return bytes;
}
public static string ByteArray2String(byte[] bytes)
{
char[] chars = new char[bytes.Length / 2];
for (int i = 0; i < chars.Length; i++)
chars[i] = BitConverter.ToChar(bytes, i * 2);
return new string(chars);
}

string result = ASCIIEncoding.UTF8.GetString(byteArray);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove specific characters with Encoding.Convert - c#

Related

Helpful byte array extensions to handle BigEndian data

Are there any rules for the XOR cipher?

C# Encoding Conversion

ASCIIEncoding In Windows Phone 7

How to convert UTF-8 byte[] to string

Categories

Resources