i want to convert latin character into html entity code in c#
for example
Th‚rŠse Ramdally should convert into
Th‚rŠse Ramdally
Thanks
vela
Possible solution is to encode every character that's beyond ASCII character table (i.e. character >= 128 or character < 32):
String source = #"Th‚rŠse Ramdally";
String result = String.Concat(source
.Select(c => (c < 128 && c > 31)
? c.ToString()
: String.Format("&#{0};", (int) c)));
Related
I convert my Hex to dump to get special character like symbol but when I try to convert my "0x18" i "\u0018" this value. Can anyone give me solution regarding this matter.
Here is my code:
public static string FromHexDump(string sText)
{
Int32 lIdx;
string prValue ="" ;
for (lIdx = 1; lIdx < sText.Length; lIdx += 2)
{
string prString = "0x" + Mid(sText, lIdx, 2);
string prUniCode = Convert.ToChar(Convert.ToInt64(prString,16)).ToString();
prValue = prValue + prUniCode;
}
return prValue;
}
I used VB language. I have a database that already encrypted text to my password and the value is BAA37D40186D like this so I loop it by step 2 and it will like this 0xBA,0xA3,0x7D,0x40,0x18,0x6D and the VB result getting like this º£}#m
You can use this code:
var myHex = '\x0633';
var formattedString += string.Format(#"\x{0:x4}", (int)myHex);
Or you can use this code from MSDN (https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/types/how-to-convert-between-hexadecimal-strings-and-numeric-types):
string hexValues = "48 65 6C 6C 6F 20 57 6F 72 6C 64 21";
string[] hexValuesSplit = hexValues.Split(' ');
foreach (string hex in hexValuesSplit)
{
// Convert the number expressed in base-16 to an integer.
int value = Convert.ToInt32(hex, 16);
// Get the character corresponding to the integral value.
string stringValue = Char.ConvertFromUtf32(value);
char charValue = (char)value;
Console.WriteLine("hexadecimal value = {0}, int value = {1}, char value = {2} or {3}",
hex, value, stringValue, charValue);
}
The question is unclear - what is the database column's type? Does it contain 6 bytes, or 12 characters with the hex encoding of the bytes? In any case, this has nothing to do with special characters or encodings.
First, 0x18 is the byte value of the Cancel Character in the Latin 1 codepage, not the pound sign. That's 0xA3. It seems that the byte values in the question are just the Latin 1 bytes for the string in hex.
.NET strings are Unicode (UTF16LE specifically). There's no UTF8 string or Latin1 string. Encodings and codepages apply when converting bytes to strings or vice versa. This is done using the Encoding class and eg Encoding.GetBytes
In this case, this code will convert the byte to the expected string form, including the unprintable character :
new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(dbBytes);
The result is :
º£}#m
With the Cancel character between # and m.
If the database column contains the byte values as strings :
it takes double the required space and
the hex values have to be converted back to bytes before converting to strings
The x format is used to convert numbers or bytes to their hex form and vice versa. For each byte value, ToString("x") returns the hex string.
The hex string can be produced from the original buffer with :
var dbBytes=new byte[] {0xBA,0xA3,0x7D,0x40,0x18,0x6D};
var hexString=String.Join("",dbBytes.Select(c=>c.ToString("x")));
There are many questions that show how to parse a byte string into a byte array. I'll just steal Jared Parson's LINQ answer :
public static byte[] StringToByteArray(string hex) {
return Enumerable.Range(0, hex.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(hex.Substring(x, 2), 16))
.ToArray();
}
With that, we can parse the hex string into a byte array and convert it to the original string :
var bytes=StringToByteArray(hexString);
var latinEncoding=Encoding.GetEncoding(1252);
var result=latinEncoding.GetString(bytes);
First of all u don't need dump but Unicode, I would recomend to read about unicode/encoding etc and why this is a problem with strings.
PS: solution : StackOverflow
I'm checking a string value using Linq, if it has all digits using the following query:
bool isAllDigits = !query.Any(ch => ch < '0' || ch > '9');
But if user enters a space, along with digits (eg: "123 456", "123456 ") this check fails. Is there a way that I can check for white spaces as well?
I don't want to trim out the white spaces, because I use this text box to search with text, which contains spaces in between.
Try this:
bool isAllDigits = query.All(c => char.IsWhiteSpace(c) || char.IsDigit(c));
These methods are built into the framework and instead of rolling your own checks, I would suggest using these.
I made a fiddle here to demonstrate.
EDIT
As pointed out in the comments, char.IsDigit() will return true for other "digit" characters as well (i.e not just '0'-'9', but also other language/culture number representations as well). The full list (I believe) can be found here.
In addition, char.IsWhiteSpace() will also return true for various types of whitespace characters (see the docs for more details).
If you want to only allow 0-9 and a regular 'ol space character, you can do this:
bool isAllDigits = s.All(c => (c < 57 && c > 48) || c == 32);
I am using the decimal values of the ASCII characters (for reference), but you could also continue doing it the way you are currently:
bool isAllDigits = s.All(c => (c < '9' && c > '0') || c == ' ');
Either way is fine, the important thing to note is the parentheses. You want any character that is greater than 0, but less than 9 OR just a single space.
bool isAllDigits = !query.Any(ch => (ch < '0' || ch > '9') && ch != ' ');
I read that there should be no difference between Latin-1 and UTF-8 for printable characters. I thought that a latin-1 'Ä' would map twice into utf-8.
Once to the Multi byte Version and once directly.
Why does it seem like this is not the case?
It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without loosing anything.
Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?
Here is a C# example:
The output on my system is
static void Main(string[] args)
{
DecodeTest("ascii7", " ~", new byte[] { 0x20, 0x7E });
DecodeTest("Latin-1", "Ä", new byte[] { 0xC4 });
DecodeTest("UTF-8", "Ä", new byte[] { 0xc3, 0x84 });
}
private static void DecodeTest(string testname, string expected, byte[] encoded)
{
var utf8 = Encoding.UTF8;
string ascii7_actual = utf8.GetString(encoded, 0, encoded.Length);
//Console_Write(encoded);
AssertEqual(testname, expected, ascii7_actual);
}
private static void AssertEqual(string testname, string expected, string actual)
{
Console.WriteLine("Test: " + testname);
if (actual != expected)
{
Console.WriteLine("\tFail");
Console.WriteLine("\tExpected: '" + expected + "' but was '" + actual + "'");
}
else
{
Console.WriteLine("\tPass");
}
}
private static void Console_Write(byte[] ascii7_encoded)
{
bool more = false;
foreach (byte b in ascii7_encoded)
{
if (more)
{
Console.Write(", ");
}
Console.Write("0x{0:X}", b);
more = true;
}
}
I read that there should be no difference between Latin-1 and UTF-8 for printable characters.
You read wrong. There is no difference between Latin-1 (and many other encodings including the rest of the ISO 8859 family) and UTF-8 for characters in the US-ASCII range (U+0000 to U+007F). They are different for all other characters.
I thought that a latin-1 'Ä' would map twice into utf-8. Once to the Multi byte Version and once directly.
For this to be possible would require UTF-8 to be stateful or to otherwise use information earlier in the stream to know whether to interpret an octet as a direct mapping or part of the multibyte encoding. One of the great advantages of UTF-8 is that it is not stateful.
Why does it seem like this is not the case?
Because it's just plain wrong.
It certainly seems like the standard could include anything that looks like a continuation byte but is not a continuation as the meaning within latin-1 without losing anything.
It couldn't do so without loosing the quality of not being stateful, which would mean corruption would destroy the entire text following the error rather than just one character.
Am I just missing a flag or something that would allow me to convert the data like described, or am I missing the bigger picture?
No, you just have a completely incorrect idea about how UTF-8 and/or Latin-1 works.
A flag would remove UTF-8's simplicity in being non-stateful and self-synchronising (you can always tell immediately if you are at a single-octet character, the start of a character or part-way into a character) as mentioned above. It would also remove UTF-8's simplicity in being algorithmic. All UTF-8 encodings map as follows.
To map from code-point to encoding:
Consider the bits of the character xxxx… e.g. for U+0027 they are 100111 for U+1F308 they are 11111001100001000.
Find the smallest of the following they will fit into:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So U+0027 is 00100111 is 0x27 and U+1F308 is 11110000 10011111 10001100 10001000 is 0xF0 0x9F 0x8C 0x88.
To go from octets to code-points you undo this.
To map to Latin 1 you just put the character into a octet, (which obviously only works if they are in the range U+0000 to U+00FF).
As you can see, there's no way that a character outside of the range U+0000 to U+007F can have matching encodings in UTF-8 and Latin-1. ("Latin 1" is also the name of CP-1252 which is a Microsoft encoding that puts further printable characters but still only a tiny fraction of those covered by UTF-8).
There is a way that a character could theoretically have more than one UTF-8 encoding, but it is explicitly banned. Consider that instead of putting the bits of U+0027 into the single unit 00100111 we could also zero-pad and put it into 11000000 10100111 encoding it as 0xC0 0xA7. The same decoding algorithm would bring us back to U+0027 (try it and see). However as well as introducing needless complexity in having such synonym encodings this also introduced security issues and indeed there have been real-world security holes caused by code that would accept over-long UTF-8.
maybe you need a scan-function to descide which decoder is required?
try this:
/// <summary>
/// Count valid UTF8-Bytes
/// </summary>
/// <returns>
/// -1 = invalid UTF8-Bytes (may Latin1)
/// 0 = ASCII only 7-Bit
/// n = Count of UTF8-Bytes
/// </returns>
public static int Utf8CodedCharCounter(byte[] value) // result:
{
int utf8Count = 0;
for (int i = 0; i < value.Length; i++)
{
byte c = value[i];
if ((c & 0x80) == 0) continue; // valid 7 Bit-ASCII -> skip
if ((c & 0xc0) == 0x80) return -1; // wrong UTF8-Char
// 2-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xe0) == 0xc0) { utf8Count++; continue; }
// 3-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xf0) == 0xe0) { utf8Count++; continue; }
// 4-Byte UTF8
i++; if (i >= value.Length || (value[i] & 0xc0) != 0x80) return -1; // wrong UTF8-Char
if ((c & 0xf8) == 0xf0) { utf8Count++; continue; }
return -1; // invalid UTF8-Length
}
return utf8Count;
}
and update your code:
private static void DecodeTest(string testname, string expected, byte[] encoded)
{
var decoder = Utf8CodedCharCounter(encoded) >= 0 ? Encoding.UTF8 : Encoding.Default;
string ascii7_actual = decoder.GetString(encoded, 0, encoded.Length);
//Console_Write(encoded);
AssertEqual(testname, expected, ascii7_actual);
}
result:
Test: ascii7
Pass
Test: Latin-1
Pass
Test: UTF-8
Pass
I have a 9 character string I am trying to provide multiple checks on. I want to first check if the first 1 - 7 characters are numbers and then say for example the first 3 characters are numbers how would I check the 5th character for a letter range of G through T.
I am using c# and have tried this so far...
string checkString = "123H56789";
Regex charactorSet = new Regex("[G-T]");
Match matchSetOne = charactorSetOne.Match(checkString, 3);
if (Char.IsNumber(checkString[0]) && Char.IsNumber(checkString[1]) && Char.IsNumber(checkString[2]))
{
if (matchSetOne.Success)
{
Console.WriteLine("3th char is a letter");
}
}
But am not sure if this is the best way to handle the validations.
UPDATE:
The digits can be 0 - 9, but can concatenate from one number to seven. Like this "12345T789" or "1R3456789" etc.
It'a easy with LINQ:
check if the first 1 - 7 characters are numbers :
var condition1 = input.Take(7).All(c => Char.IsDigit(c));
check the 5th character for a letter range of G through T
var condition2 = input.ElementAt(4) >= 'G' && input.ElementAt(4) <= 'T';
As it is, both conditions can't be true at the same time (if the first 7 chars are digits, then the 5th char can't be a letter).
How can I get this ƒ character from the ASCII table..? I have tried like this.
txt2 = (char)131;
I can able to get till 127 values only.. If I give more than 127 its returning NULL value. So how can I get till 255 ?
ƒ isn't ASCII... it is Unicode, Unicode Character 'LATIN SMALL LETTER F WITH HOOK' (U+0192).
char ch = 'ƒ';
or
char ch = (char)0x0192;
or
char ch = '\x0192';
There are only 128 characters in the ASCII set (0-127), and there are no non-american letters (there are only A-Z and a-z)