Convert a Unicode string to an escaped ASCII string - c#

How can I convert this string:
This string contains the Unicode character Pi(π)
into an escaped ASCII string:
This string contains the Unicode character Pi(\u03a0)
and vice versa?
The current Encoding available in C# converts the π character to "?". I need to preserve that character.

This goes back and forth to and from the \uXXXX format.
class Program {
static void Main( string[] args ) {
string unicodeString = "This function contains a unicode character pi (\u03a0)";
Console.WriteLine( unicodeString );
string encoded = EncodeNonAsciiCharacters(unicodeString);
Console.WriteLine( encoded );
string decoded = DecodeEncodedNonAsciiCharacters( encoded );
Console.WriteLine( decoded );
}
static string EncodeNonAsciiCharacters( string value ) {
StringBuilder sb = new StringBuilder();
foreach( char c in value ) {
if( c > 127 ) {
// This character is too big for ASCII
string encodedValue = "\\u" + ((int) c).ToString( "x4" );
sb.Append( encodedValue );
}
else {
sb.Append( c );
}
}
return sb.ToString();
}
static string DecodeEncodedNonAsciiCharacters( string value ) {
return Regex.Replace(
value,
#"\\u(?<Value>[a-zA-Z0-9]{4})",
m => {
return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
} );
}
}
Outputs:
This function contains a unicode character pi (π)
This function contains a unicode character pi (\u03a0)
This function contains a unicode character pi (π)

For Unescape You can simply use this functions:
System.Text.RegularExpressions.Regex.Unescape(string)
System.Uri.UnescapeDataString(string)
I suggest using this method (It works better with UTF-8):
UnescapeDataString(string)

string StringFold(string input, Func<char, string> proc)
{
return string.Concat(input.Select(proc).ToArray());
}
string FoldProc(char input)
{
if (input >= 128)
{
return string.Format(#"\u{0:x4}", (int)input);
}
return input.ToString();
}
string EscapeToAscii(string input)
{
return StringFold(input, FoldProc);
}

As a one-liner:
var result = Regex.Replace(input, #"[^\x00-\x7F]", c =>
string.Format(#"\u{0:x4}", (int)c.Value[0]));

class Program
{
static void Main(string[] args)
{
char[] originalString = "This string contains the unicode character Pi(π)".ToCharArray();
StringBuilder asAscii = new StringBuilder(); // store final ascii string and Unicode points
foreach (char c in originalString)
{
// test if char is ascii, otherwise convert to Unicode Code Point
int cint = Convert.ToInt32(c);
if (cint <= 127 && cint >= 0)
asAscii.Append(c);
else
asAscii.Append(String.Format("\\u{0:x4} ", cint).Trim());
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}
}
All non-ASCII chars are converted to their Unicode Code Point representation and appended to the final string.

Here is my current implementation:
public static class UnicodeStringExtensions
{
public static string EncodeNonAsciiCharacters(this string value) {
var bytes = Encoding.Unicode.GetBytes(value);
var sb = StringBuilderCache.Acquire(value.Length);
bool encodedsomething = false;
for (int i = 0; i < bytes.Length; i += 2) {
var c = BitConverter.ToUInt16(bytes, i);
if ((c >= 0x20 && c <= 0x7f) || c == 0x0A || c == 0x0D) {
sb.Append((char) c);
} else {
sb.Append($"\\u{c:x4}");
encodedsomething = true;
}
}
if (!encodedsomething) {
StringBuilderCache.Release(sb);
return value;
}
return StringBuilderCache.GetStringAndRelease(sb);
}
public static string DecodeEncodedNonAsciiCharacters(this string value)
=> Regex.Replace(value,/*language=regexp*/#"(?:\\u[a-fA-F0-9]{4})+", Decode);
static readonly string[] Splitsequence = new [] { "\\u" };
private static string Decode(Match m) {
var bytes = m.Value.Split(Splitsequence, StringSplitOptions.RemoveEmptyEntries)
.Select(s => ushort.Parse(s, NumberStyles.HexNumber)).SelectMany(BitConverter.GetBytes).ToArray();
return Encoding.Unicode.GetString(bytes);
}
}
This passes a test:
public void TestBigUnicode() {
var s = "\U00020000";
var encoded = s.EncodeNonAsciiCharacters();
var decoded = encoded.DecodeEncodedNonAsciiCharacters();
Assert.Equals(s, decoded);
}
with the encoded value: "\ud840\udc00"
This implementation makes use of a StringBuilderCache (reference source link)

A small patch to #Adam Sills's answer which solves FormatException on cases where the input string like "c:\u00ab\otherdirectory\" plus RegexOptions.Compiled makes the Regex compilation much faster:
private static Regex DECODING_REGEX = new Regex(#"\\u(?<Value>[a-fA-F0-9]{4})", RegexOptions.Compiled);
private const string PLACEHOLDER = #"#!#";
public static string DecodeEncodedNonAsciiCharacters(this string value)
{
return DECODING_REGEX.Replace(
value.Replace(#"\\", PLACEHOLDER),
m => {
return ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString(); })
.Replace(PLACEHOLDER, #"\\");
}

To store actual Unicode codepoints, you have to first decode the String's UTF-16 codeunits to UTF-32 codeunits (which are currently the same as the Unicode codepoints). Use System.Text.Encoding.UTF32.GetBytes() for that, and then write the resulting bytes to the StringBuilder as needed,i.e.
static void Main(string[] args)
{
String originalString = "This string contains the unicode character Pi(π)";
Byte[] bytes = Encoding.UTF32.GetBytes(originalString);
StringBuilder asAscii = new StringBuilder();
for (int idx = 0; idx < bytes.Length; idx += 4)
{
uint codepoint = BitConverter.ToUInt32(bytes, idx);
if (codepoint <= 127)
asAscii.Append(Convert.ToChar(codepoint));
else
asAscii.AppendFormat("\\u{0:x4}", codepoint);
}
Console.WriteLine("Final string: {0}", asAscii);
Console.ReadKey();
}

You need to use the Convert() method in the Encoding class:
Create an Encoding object that represents ASCII encoding
Create an Encoding object that represents Unicode encoding
Call Encoding.Convert() with the source encoding, the destination encoding, and the string to be encoded
There is an example here:
using System;
using System.Text;
namespace ConvertExample
{
class ConvertExampleClass
{
static void Main()
{
string unicodeString = "This string contains the unicode character Pi(\u03a0)";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte[].
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
// This is a slightly different approach to converting to illustrate
// the use of GetCharCount/GetChars.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
// Display the strings created before and after the conversion.
Console.WriteLine("Original string: {0}", unicodeString);
Console.WriteLine("Ascii converted string: {0}", asciiString);
}
}
}

Related

HttpContext.Session.TryGetValue gives nonsense [duplicate]

I am trying to "decode" this following Base64 string:
OBFZDTcPCxlCKhdXCQ0kMQhKPh9uIgYIAQxALBtZAwUeOzcdcUEeW0dMO1kbPElWCV1ISFFKZ0kdWFlLAURPZhEFQVseXVtPOUUICVhMAzcfZ14AVEdIVVgfAUIBWVpOUlAeaUVMXFlKIy9rGUN0VF08Oz1POxFfTCcVFw1LMQNbBQYWAQ==
This is what I know about the string itself:
The original string is first passed through the following code:
private static string m000493(string p0, string p1)
{
StringBuilder builder = new StringBuilder(p0);
StringBuilder builder2 = new StringBuilder(p1);
StringBuilder builder3 = new StringBuilder(p0.Length);
int num = 0;
Label_0084:
while (num < builder.Length)
{
int num2 = 0;
while (num2 < p1.Length)
{
if ((num == builder.Length) || (num2 == builder2.Length))
{
MessageBox.Show("EH?");
goto Label_0084;
}
char ch = builder[num];
char ch2 = builder2[num2];
ch = (char)(ch ^ ch2);
builder3.Append(ch);
num2++;
num++;
}
}
return m0001cd(builder3.ToString());
}
The p1 part in the code is supposed to be the string "_p0lizei.".
It is then converted to a Base64 string by the following code:
private static string m0001cd(string p0)
{
string str2;
try
{
byte[] buffer = new byte[p0.Length];
str2 = Convert.ToBase64String(Encoding.UTF8.GetBytes(p0));
}
catch (Exception exception)
{
throw new Exception("Error in base64Encode" + exception.Message);
}
return str2;
}
The question is, how do I decode the Base64 string so that I can find out what the original string is?
Simple:
byte[] data = Convert.FromBase64String(encodedString);
string decodedString = Encoding.UTF8.GetString(data);
The m000493 method seems to perform some kind of XOR encryption. This means that the same method can be used for both encoding and decoding the text. All you have to do is reverse m0001cd:
string p0 = Encoding.UTF8.GetString(Convert.FromBase64String("OBFZDT..."));
string result = m000493(p0, "_p0lizei.");
// result == "gaia^unplugged^Ta..."
with return m0001cd(builder3.ToString()); changed to return builder3.ToString();.
// Decode a Base64 string to a string
public static string DecodeBase64(string value)
{
if(string.IsNullOrEmpty(value))
return string.Empty;
var valueBytes = System.Convert.FromBase64String(value);
return System.Text.Encoding.UTF8.GetString(valueBytes);
}

Convert string to GSM 7-bit using C#

How can I convert a string to correct GSM encoded value to be sent to a mobile operator?
Below is a port of gnibbler's answer, slightly modified and with a detailed explantation.
Example:
string output = GSMConverter.StringToGSMHexString("Hello World");
// output = "48-65-6C-6C-6F-20-57-6F-72-6C-64"
Implementation:
// Data/info taken from http://en.wikipedia.org/wiki/GSM_03.38
public static class GSMConverter
{
// The index of the character in the string represents the index
// of the character in the respective character set
// Basic Character Set
private const string BASIC_SET =
"#£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞ\x1bÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?" +
"¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà";
// Basic Character Set Extension
private const string EXTENSION_SET =
"````````````````````^```````````````````{}`````\\````````````[~]`" +
"|````````````````````````````````````€``````````````````````````";
// If the character is in the extension set, it must be preceded
// with an 'ESC' character whose index is '27' in the Basic Character Set
private const int ESC_INDEX = 27;
public static string StringToGSMHexString(string text, bool delimitWithDash = true)
{
// Replace \r\n with \r to reduce character count
text = text.Replace(Environment.NewLine, "\r");
// Use this list to store the index of the character in
// the basic/extension character sets
var indicies = new List<int>();
foreach (var c in text)
{
int index = BASIC_SET.IndexOf(c);
if(index != -1) {
indicies.Add(index);
continue;
}
index = EXTENSION_SET.IndexOf(c);
if(index != -1) {
// Add the 'ESC' character index before adding
// the extension character index
indicies.Add(ESC_INDEX);
indicies.Add(index);
continue;
}
}
// Convert indicies to 2-digit hex
var hex = indicies.Select(i => i.ToString("X2")).ToArray();
string delimiter = delimitWithDash ? "-" : "";
// Delimit output
string delimited = string.Join(delimiter, hex);
return delimited;
}
}
The code of Omar Didn't work for me. But I found The actual code that works:
public static string Encode7bit(string s)
{
string empty = string.Empty;
for (int index = s.Length - 1; index >= 0; --index)
empty += Convert.ToString((byte)s[index], 2).PadLeft(8, '0').Substring(1);
string str1 = empty.PadLeft((int)Math.Ceiling((Decimal)empty.Length / new Decimal(8)) * 8, '0');
List<byte> byteList = new List<byte>();
while (str1 != string.Empty)
{
string str2 = str1.Substring(0, str1.Length > 7 ? 8 : str1.Length).PadRight(8, '0');
str1 = str1.Length > 7 ? str1.Substring(8) : string.Empty;
byteList.Add(Convert.ToByte(str2, 2));
}
byteList.Reverse();
var messageBytes = byteList.ToArray();
var encodedData = "";
foreach (byte b in messageBytes)
{
encodedData += Convert.ToString(b, 16).PadLeft(2, '0');
}
return encodedData.ToUpper();
}

Convert from string ascii to string Hex

Suppose I have this string
string str = "1234"
I need a function that convert this string to this string:
"0x31 0x32 0x33 0x34"
I searched online and found a lot of similar things, but not an answer to this question.
string str = "1234";
char[] charValues = str.ToCharArray();
string hexOutput="";
foreach (char _eachChar in charValues )
{
// Get the integral value of the character.
int value = Convert.ToInt32(_eachChar);
// Convert the decimal value to a hexadecimal value in string form.
hexOutput += String.Format("{0:X}", value);
// to make output as your eg
// hexOutput +=" "+ String.Format("{0:X}", value);
}
//here is the HEX hexOutput
//use hexOutput
This seems the job for an extension method
void Main()
{
string test = "ABCD1234";
string result = test.ToHex();
}
public static class StringExtensions
{
public static string ToHex(this string input)
{
StringBuilder sb = new StringBuilder();
foreach(char c in input)
sb.AppendFormat("0x{0:X2} ", (int)c);
return sb.ToString().Trim();
}
}
A few tips.
Do not use string concatenation. Strings are immutable and thus every time you concatenate a string a new one is created. (Pressure on memory usage and fragmentation) A StringBuilder is generally more efficient for this case.
Strings are array of characters and using a foreach on a string already gives access to the character array
These common codes are well suited for an extension method included in a utility library always available for your projects (code reuse)
Convert to byte array and then to hex
string data = "1234";
// Convert to byte array
byte[] retval = System.Text.Encoding.ASCII.GetBytes(data);
// Convert to hex and add "0x"
data = "0x" + BitConverter.ToString(retval).Replace("-", " 0x");
System.Diagnostics.Debug.WriteLine(data);
static void Main(string[] args)
{
string str = "1234";
char[] array = str.ToCharArray();
string final = "";
foreach (var i in array)
{
string hex = String.Format("{0:X}", Convert.ToInt32(i));
final += hex.Insert(0, "0X") + " ";
}
final = final.TrimEnd();
Console.WriteLine(final);
}
Output will be;
0X31 0X32 0X33 0X34
Here is a DEMO.
A nice declarative way to solve this would be:
var str = "1234"
string.Join(" ", str.Select(c => $"0x{(int)c:X}"))
// Outputs "0x31 0x32 0x33 0x34"
[TestMethod]
public void ToHex()
{
string str = "1234A";
var result = str.Select(s => string.Format("0x{0:X2}", ((byte)s)));
foreach (var item in result)
{
Debug.WriteLine(item);
}
}
This is one I've used:
private static string ConvertToHex(byte[] bytes)
{
var builder = new StringBuilder();
var hexCharacters = new[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
for (var i = 0; i < bytes.Length; i++)
{
int firstValue = (bytes[i] >> 4) & 0x0F;
int secondValue = bytes[i] & 0x0F;
char firstCharacter = hexCharacters[firstValue];
char secondCharacter = hexCharacters[secondValue];
builder.Append("0x");
builder.Append(firstCharacter);
builder.Append(secondCharacter);
builder.Append(' ');
}
return builder.ToString().Trim(' ');
}
And then used like:
string test = "1234";
ConvertToHex(Encoding.UTF8.GetBytes(test));

Determine if string has all unique characters

I'm working through an algorithm problem set which poses the following question:
"Determine if a string has all unique characters. Assume you can only use arrays".
I have a working solution, but I would like to see if there is anything better optimized in terms of time complexity. I do not want to use LINQ. Appreciate any help you can provide!
static void Main(string[] args)
{
FindDupes("crocodile");
}
static string FindDupes(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine("String is either empty or too long");
}
char[] str = new char[text.Length];
char[] output = new char[text.Length];
int strLength = 0;
int outputLength = 0;
foreach (char value in text)
{
bool dupe = false;
for (int i = 0; i < strLength; i++)
{
if (value == str[i])
{
dupe = true;
break;
}
}
if (!dupe)
{
str[strLength] = value;
strLength++;
output[outputLength] = value;
outputLength++;
}
}
return new string(output, 0, outputLength);
}
If time complexity is all you care about you could map the characters to int values, then have an array of bool values which remember if you've seen a particular character value previously.
Something like ... [not tested]
bool[] array = new bool[256]; // or larger for Unicode
foreach (char value in text)
if (array[(int)value])
return false;
else
array[(int)value] = true;
return true;
try this,
string RemoveDuplicateChars(string key)
{
string table = string.Empty;
string result = string.Empty;
foreach (char value in key)
{
if (table.IndexOf(value) == -1)
{
table += value;
result += value;
}
}
return result;
}
usage
Console.WriteLine(RemoveDuplicateChars("hello"));
Console.WriteLine(RemoveDuplicateChars("helo"));
Console.WriteLine(RemoveDuplicateChars("Crocodile"));
output
helo
helo
Crocdile
public boolean ifUnique(String toCheck){
String str="";
for(int i=0;i<toCheck.length();i++)
{
if(str.contains(""+toCheck.charAt(i)))
return false;
str+=toCheck.charAt(i);
}
return true;
}
EDIT:
You may also consider to omit the boundary case where toCheck is an empty string.
The following code works:
static void Main(string[] args)
{
isUniqueChart("text");
Console.ReadKey();
}
static Boolean isUniqueChart(string text)
{
if (text.Length == 0 || text.Length > 256)
{
Console.WriteLine(" The text is empty or too larg");
return false;
}
Boolean[] char_set = new Boolean[256];
for (int i = 0; i < text.Length; i++)
{
int val = text[i];//already found this char in the string
if (char_set[val])
{
Console.WriteLine(" The text is not unique");
return false;
}
char_set[val] = true;
}
Console.WriteLine(" The text is unique");
return true;
}
If the string has only lower case letters (a-z) or only upper case letters (A-Z) you can use a very optimized O(1) solution.Also O(1) space.
c++ code :
bool checkUnique(string s){
if(s.size() >26)
return false;
int unique=0;
for (int i = 0; i < s.size(); ++i) {
int j= s[i]-'a';
if(unique & (1<<j)>0)
return false;
unique=unique|(1<<j);
}
return true;
}
Remove Duplicates in entire Unicode Range
Not all characters can be represented by a single C# char. If you need to take into account combining characters and extended unicode characters, you need to:
parse the characters using StringInfo
normalize the characters
find duplicates amongst the normalized strings
Code to remove duplicate characters:
We keep track of the entropy, storing the normalized characters (each character is a string, because many characters require more than 1 C# char). In case a character (normalized) is not yet stored in the entropy, we append the character (in specified form) to the output.
public static class StringExtension
{
public static string RemoveDuplicateChars(this string text)
{
var output = new StringBuilder();
var entropy = new HashSet<string>();
var iterator = StringInfo.GetTextElementEnumerator(text);
while (iterator.MoveNext())
{
var character = iterator.GetTextElement();
if (entropy.Add(character.Normalize()))
{
output.Append(character);
}
}
return output.ToString();
}
}
Unit Test:
Let's test a string that contains variations on the letter A, including the Angstrom sign Å. The Angstrom sign has unicode codepoint u212B, but can also be constructed as the letter A with the diacritic u030A. Both represent the same character.
// ÅÅAaA
var input = "\u212BA\u030AAaA";
// ÅAa
var output = input.RemoveDuplicateChars();
Further extensions could allow for a selector function that determines how to normalize characters. For instance the selector (x) => x.ToUpperInvariant().Normalize() would allow for case-insensitive duplicate removal.
public static bool CheckUnique(string str)
{
int accumulator = 0;
foreach (int asciiCode in str)
{
int shiftedBit = 1 << (asciiCode - ' ');
if ((accumulator & shiftedBit) > 0)
return false;
accumulator |= shiftedBit;
}
return true;
}

Replacing a char at a given index in string? [duplicate]

This question already has answers here:
how do I set a character at an index in a string in c#?
(7 answers)
Closed 5 years ago.
String does not have ReplaceAt(), and I'm tumbling a bit on how to make a decent function that does what I need. I suppose the CPU cost is high, but the string sizes are small so it's all ok
Use a StringBuilder:
StringBuilder sb = new StringBuilder(theString);
sb[index] = newChar;
theString = sb.ToString();
The simplest approach would be something like:
public static string ReplaceAt(this string input, int index, char newChar)
{
if (input == null)
{
throw new ArgumentNullException("input");
}
char[] chars = input.ToCharArray();
chars[index] = newChar;
return new string(chars);
}
This is now an extension method so you can use:
var foo = "hello".ReplaceAt(2, 'x');
Console.WriteLine(foo); // hexlo
It would be nice to think of some way that only required a single copy of the data to be made rather than the two here, but I'm not sure of any way of doing that. It's possible that this would do it:
public static string ReplaceAt(this string input, int index, char newChar)
{
if (input == null)
{
throw new ArgumentNullException("input");
}
StringBuilder builder = new StringBuilder(input);
builder[index] = newChar;
return builder.ToString();
}
... I suspect it entirely depends on which version of the framework you're using.
string s = "ihj";
char[] array = s.ToCharArray();
array[1] = 'p';
s = new string(array);
Strings are immutable objects, so you can't replace a given character in the string.
What you can do is you can create a new string with the given character replaced.
But if you are to create a new string, why not use a StringBuilder:
string s = "abc";
StringBuilder sb = new StringBuilder(s);
sb[1] = 'x';
string newS = sb.ToString();
//newS = "axc";
I suddenly needed to do this task and found this topic.
So, this is my linq-style variant:
public static class Extensions
{
public static string ReplaceAt(this string value, int index, char newchar)
{
if (value.Length <= index)
return value;
else
return string.Concat(value.Select((c, i) => i == index ? newchar : c));
}
}
and then, for example:
string instr = "Replace$dollar";
string outstr = instr.ReplaceAt(7, ' ');
In the end I needed to utilize .Net Framework 2, so I use a StringBuilder class variant though.
If your project (.csproj) allow unsafe code probably this is the faster solution:
namespace System
{
public static class StringExt
{
public static unsafe void ReplaceAt(this string source, int index, char value)
{
if (source == null)
throw new ArgumentNullException("source");
if (index < 0 || index >= source.Length)
throw new IndexOutOfRangeException("invalid index value");
fixed (char* ptr = source)
{
ptr[index] = value;
}
}
}
}
You may use it as extension method of String objects.
public string ReplaceChar(string sourceString, char newChar, int charIndex)
{
try
{
// if the sourceString exists
if (!String.IsNullOrEmpty(sourceString))
{
// verify the lenght is in range
if (charIndex < sourceString.Length)
{
// Get the oldChar
char oldChar = sourceString[charIndex];
// Replace out the char ***WARNING - THIS CODE IS WRONG - it replaces ALL occurrences of oldChar in string!!!***
sourceString.Replace(oldChar, newChar);
}
}
}
catch (Exception error)
{
// for debugging only
string err = error.ToString();
}
// return value
return sourceString;
}

Categories

Resources